Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to "Content based storage" in project ideas[2]. PATCH 1 is a hang fix with deduplication on, but it''s also useful without dedup in practice use. PATCH 2 and 3 are targetting delayed refs'' scalability problems, which are uncovered by the dedup feature. PATCH 4 is a speed-up improvement, which is about dedup and quota. PATCH 5-8 is the preparation for dedup implementation. PATCH 9 shows how we implement dedup feature. PATCH 10 fixes a backref walking bug with dedup. PATCH 11 fixes a free space bug of dedup extents on error handling. PATCH 12 fixes a race bug on dedup writes. PATCH 13 adds the ioctl to control dedup feature. And there is also a btrfs-progs patch(PATCH 14) which involves all details of how to control dedup feature. I''ve tested this with xfstests by adding a inline dedup ''enable & on'' in xfstests'' mount and scratch_mount. TODO: * a bit-to-bit comparison callback. All comments are welcome! [1]: http://en.wikipedia.org/wiki/Data_deduplication [2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage v7: - rebase onto the lastest btrfs - break a big patch into smaller ones to make reviewers happy. - kill mount options of dedup and use ioctl method instead. - fix two crash due to the special dedup ref For former patch sets: v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512 v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257 v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751 v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433 v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959 Liu Bo (13): Btrfs: skip merge part for delayed data refs Btrfs: improve the delayed refs process in rm case Btrfs: introduce a head ref rbtree Btrfs: disable qgroups accounting when quata_enable is 0 Btrfs: introduce dedup tree and relatives Btrfs: introduce dedup tree operations Btrfs: introduce dedup state Btrfs: make ordered extent aware of dedup Btrfs: online(inband) data dedup Btrfs: skip dedup reference during backref walking Btrfs: don''t return space for dedup extent Btrfs: fix a crash of dedup ref Btrfs: add ioctl of dedup control fs/btrfs/backref.c | 9 + fs/btrfs/ctree.c | 2 +- fs/btrfs/ctree.h | 85 ++++++ fs/btrfs/delayed-ref.c | 159 +++++++---- fs/btrfs/delayed-ref.h | 8 + fs/btrfs/disk-io.c | 45 +++ fs/btrfs/extent-tree.c | 190 +++++++++++-- fs/btrfs/extent_io.c | 22 ++- fs/btrfs/extent_io.h | 15 + fs/btrfs/file-item.c | 230 +++++++++++++++ fs/btrfs/inode.c | 641 +++++++++++++++++++++++++++++++++++++----- fs/btrfs/ioctl.c | 167 +++++++++++ fs/btrfs/ordered-data.c | 38 ++- fs/btrfs/ordered-data.h | 13 +- fs/btrfs/qgroup.c | 3 + fs/btrfs/relocation.c | 3 + fs/btrfs/transaction.c | 4 +- include/trace/events/btrfs.h | 3 +- include/uapi/linux/btrfs.h | 11 + 19 files changed, 1478 insertions(+), 170 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Oct-14 04:59 UTC
[PATCH v7 01/13] Btrfs: skip merge part for delayed data refs
When we have data deduplication on, we''ll hang on the merge part because it needs to verify every queued delayed data refs related to this disk offset but we may have millions refs. And in the case of delayed data refs, we don''t usually have too much data refs to merge. So it''s safe to shut it down for data refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index e4d467b..b0d5d79 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -320,6 +320,13 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, struct rb_node *node; u64 seq = 0; + /* + * We don''t have too much refs to merge in the case of delayed data + * refs. + */ + if (head->is_data) + return; + spin_lock(&fs_info->tree_mod_seq_lock); if (!list_empty(&fs_info->tree_mod_seq_list)) { struct seq_list *elem; -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Oct-14 04:59 UTC
[PATCH v7 02/13] Btrfs: improve the delayed refs process in rm case
While removing a file with dedup extents, we could have a great number of delayed refs pending to process, and these refs refer to droping a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type. But in order to prevent an extent''s ref count from going down to zero when there still are pending delayed refs, we first select those "adding a ref" ones, which is of BTRFS_ADD_DELAYED_REF type. So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type, but we have to walk all the refs issued to the extent to find any BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start over again to find BTRFS_DROP_DELAYED_REF. This is really unnecessary, we can improve this by tracking how many BTRFS_ADD_DELAYED_REF refs we have and search by the right type. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 10 ++++++++++ fs/btrfs/delayed-ref.h | 3 +++ fs/btrfs/extent-tree.c | 17 ++++++++++++++++- 3 files changed, 29 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b0d5d79..9596649 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -543,6 +543,10 @@ update_existing_head_ref(struct btrfs_delayed_ref_node *existing, * update the reference mod on the head to reflect this new operation */ existing->ref_mod += update->ref_mod; + + WARN_ON_ONCE(update->ref_mod > 1); + if (update->ref_mod == 1) + existing_ref->add_cnt++; } /* @@ -604,6 +608,12 @@ static noinline void add_delayed_ref_head(struct btrfs_fs_info *fs_info, head_ref->must_insert_reserved = must_insert_reserved; head_ref->is_data = is_data; + /* track added ref, more comments in select_delayed_ref() */ + if (count_mod == 1) + head_ref->add_cnt = 1; + else + head_ref->add_cnt = 0; + INIT_LIST_HEAD(&head_ref->cluster); mutex_init(&head_ref->mutex); diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 70b962c..9377b27 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -84,6 +84,9 @@ struct btrfs_delayed_ref_head { struct list_head cluster; struct btrfs_delayed_extent_op *extent_op; + + int add_cnt; + /* * when a new extent is allocated, it is just reserved in memory * The actual extent isn''t inserted into the extent allocation tree diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d58bef1..f5bf1a8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2280,6 +2280,16 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head) struct rb_node *node; struct btrfs_delayed_ref_node *ref; int action = BTRFS_ADD_DELAYED_REF; + + /* + * track the count of BTRFS_ADD_DELAYED_REF, + * in the case that there''s no BTRFS_ADD_DELAYED_REF while there''re a + * a great number of BTRFS_DROP_DELAYED_REF, + * it''ll waste time on searching BTRFS_ADD_DELAYED_REF, usually this + * happens with dedup enabled. + */ + if (head->add_cnt == 0) + action = BTRFS_DROP_DELAYED_REF; again: /* * select delayed ref of type BTRFS_ADD_DELAYED_REF first. @@ -2294,8 +2304,11 @@ again: rb_node); if (ref->bytenr != head->node.bytenr) break; - if (ref->action == action) + if (ref->action == action) { + if (action == BTRFS_ADD_DELAYED_REF) + head->add_cnt--; return ref; + } node = rb_prev(node); } if (action == BTRFS_ADD_DELAYED_REF) { @@ -2371,6 +2384,8 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, * there are still refs with lower seq numbers in the * process of being added. Don''t run this ref yet. */ + if (ref->action == BTRFS_ADD_DELAYED_REF) + locked_ref->add_cnt++; list_del_init(&locked_ref->cluster); btrfs_delayed_ref_unlock(locked_ref); locked_ref = NULL; -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The way how we process delayed refs is 1) get a bunch of head refs, 2) pick up one head ref, 3) go one node back for any delayed ref updates. The head ref is also linked in the same rbtree as the delayed ref is, so in 1) stage, we have to walk one by one including not only head refs, but delayed refs. When we have a great number of delayed refs pending to process, this''ll cost time a lot. Here we introduce a head ref specific rbtree, it only has head refs, so troubles go away. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 124 ++++++++++++++++++++++++++++-------------------- fs/btrfs/delayed-ref.h | 5 ++ fs/btrfs/disk-io.c | 3 + fs/btrfs/extent-tree.c | 21 +++++--- fs/btrfs/transaction.c | 4 +- 5 files changed, 98 insertions(+), 59 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9596649..9e1a1c9 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -161,35 +161,61 @@ static struct btrfs_delayed_ref_node *tree_insert(struct rb_root *root, return NULL; } +/* insert a new ref to head ref rbtree */ +static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent_node = NULL; + struct btrfs_delayed_ref_head *entry; + struct btrfs_delayed_ref_head *ins; + u64 bytenr; + + ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node); + bytenr = ins->node.bytenr; + while (*p) { + parent_node = *p; + entry = rb_entry(parent_node, struct btrfs_delayed_ref_head, + href_node); + + if (bytenr < entry->node.bytenr) + p = &(*p)->rb_left; + else if (bytenr > entry->node.bytenr) + p = &(*p)->rb_right; + else + return entry; + } + + rb_link_node(node, parent_node, p); + rb_insert_color(node, root); + return NULL; +} + /* * find an head entry based on bytenr. This returns the delayed ref * head if it was able to find one, or NULL if nothing was in that spot. * If return_bigger is given, the next bigger entry is returned if no exact * match is found. */ -static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root, - u64 bytenr, - struct btrfs_delayed_ref_node **last, - int return_bigger) +static struct btrfs_delayed_ref_head * +find_ref_head(struct rb_root *root, u64 bytenr, + struct btrfs_delayed_ref_head **last, int return_bigger) { struct rb_node *n; - struct btrfs_delayed_ref_node *entry; + struct btrfs_delayed_ref_head *entry; int cmp = 0; again: n = root->rb_node; entry = NULL; while (n) { - entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node); - WARN_ON(!entry->in_tree); + entry = rb_entry(n, struct btrfs_delayed_ref_head, href_node); if (last) *last = entry; - if (bytenr < entry->bytenr) + if (bytenr < entry->node.bytenr) cmp = -1; - else if (bytenr > entry->bytenr) - cmp = 1; - else if (!btrfs_delayed_ref_is_head(entry)) + else if (bytenr > entry->node.bytenr) cmp = 1; else cmp = 0; @@ -203,12 +229,12 @@ again: } if (entry && return_bigger) { if (cmp > 0) { - n = rb_next(&entry->rb_node); + n = rb_next(&entry->href_node); if (!n) n = rb_first(root); - entry = rb_entry(n, struct btrfs_delayed_ref_node, - rb_node); - bytenr = entry->bytenr; + entry = rb_entry(n, struct btrfs_delayed_ref_head, + href_node); + bytenr = entry->node.bytenr; return_bigger = 0; goto again; } @@ -246,6 +272,12 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_node *ref) { rb_erase(&ref->rb_node, &delayed_refs->root); + if (btrfs_delayed_ref_is_head(ref)) { + struct btrfs_delayed_ref_head *head; + + head = btrfs_delayed_node_to_head(ref); + rb_erase(&head->href_node, &delayed_refs->href_root); + } ref->in_tree = 0; btrfs_put_delayed_ref(ref); delayed_refs->num_entries--; @@ -386,42 +418,35 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, int count = 0; struct btrfs_delayed_ref_root *delayed_refs; struct rb_node *node; - struct btrfs_delayed_ref_node *ref; - struct btrfs_delayed_ref_head *head; + struct btrfs_delayed_ref_head *head = NULL; delayed_refs = &trans->transaction->delayed_refs; - if (start == 0) { - node = rb_first(&delayed_refs->root); - } else { - ref = NULL; - find_ref_head(&delayed_refs->root, start + 1, &ref, 1); - if (ref) { - node = &ref->rb_node; - } else - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); + + if (start) { + find_ref_head(&delayed_refs->href_root, start + 1, &head, 1); + if (head) + node = &head->href_node; } again: while (node && count < 32) { - ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node); - if (btrfs_delayed_ref_is_head(ref)) { - head = btrfs_delayed_node_to_head(ref); - if (list_empty(&head->cluster)) { - list_add_tail(&head->cluster, cluster); - delayed_refs->run_delayed_start - head->node.bytenr; - count++; + head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); + if (list_empty(&head->cluster)) { + list_add_tail(&head->cluster, cluster); + delayed_refs->run_delayed_start + head->node.bytenr; + count++; - WARN_ON(delayed_refs->num_heads_ready == 0); - delayed_refs->num_heads_ready--; - } else if (count) { - /* the goal of the clustering is to find extents - * that are likely to end up in the same extent - * leaf on disk. So, we don''t want them spread - * all over the tree. Stop now if we''ve hit - * a head that was already in use - */ - break; - } + WARN_ON(delayed_refs->num_heads_ready == 0); + delayed_refs->num_heads_ready--; + } else if (count) { + /* the goal of the clustering is to find extents + * that are likely to end up in the same extent + * leaf on disk. So, we don''t want them spread + * all over the tree. Stop now if we''ve hit + * a head that was already in use + */ + break; } node = rb_next(node); } @@ -433,7 +458,7 @@ again: * clusters. start from the beginning and try again */ start = 0; - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); goto again; } return 1; @@ -629,6 +654,7 @@ static noinline void add_delayed_ref_head(struct btrfs_fs_info *fs_info, */ kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref); } else { + htree_insert(&delayed_refs->href_root, &head_ref->href_node); delayed_refs->num_heads++; delayed_refs->num_heads_ready++; delayed_refs->num_entries++; @@ -886,14 +912,10 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head * btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr) { - struct btrfs_delayed_ref_node *ref; struct btrfs_delayed_ref_root *delayed_refs; delayed_refs = &trans->transaction->delayed_refs; - ref = find_ref_head(&delayed_refs->root, bytenr, NULL, 0); - if (ref) - return btrfs_delayed_node_to_head(ref); - return NULL; + return find_ref_head(&delayed_refs->href_root, bytenr, NULL, 0); } void btrfs_delayed_ref_exit(void) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 9377b27..6a0295b 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -83,6 +83,8 @@ struct btrfs_delayed_ref_head { struct list_head cluster; + struct rb_node href_node; + struct btrfs_delayed_extent_op *extent_op; int add_cnt; @@ -121,6 +123,9 @@ struct btrfs_delayed_data_ref { struct btrfs_delayed_ref_root { struct rb_root root; + /* head ref rbtree */ + struct rb_root href_root; + /* this spin lock protects the rbtree and the entries inside */ spinlock_t lock; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4ae17ed..289c09b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3860,6 +3860,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, ref->in_tree = 0; rb_erase(&ref->rb_node, &delayed_refs->root); + if (head) + rb_erase(&head->href_node, &delayed_refs->href_root); + delayed_refs->num_entries--; spin_unlock(&delayed_refs->lock); if (head) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f5bf1a8..6d92c54 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2438,6 +2438,10 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, ref->in_tree = 0; rb_erase(&ref->rb_node, &delayed_refs->root); + if (btrfs_delayed_ref_is_head(ref)) { + rb_erase(&locked_ref->href_node, + &delayed_refs->href_root); + } delayed_refs->num_entries--; if (!btrfs_delayed_ref_is_head(ref)) { /* @@ -2640,7 +2644,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, { struct rb_node *node; struct btrfs_delayed_ref_root *delayed_refs; - struct btrfs_delayed_ref_node *ref; + struct btrfs_delayed_ref_head *head; struct list_head cluster; int ret; u64 delayed_start; @@ -2770,18 +2774,18 @@ again: spin_lock(&delayed_refs->lock); } - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); if (!node) goto out; count = (unsigned long)-1; while (node) { - ref = rb_entry(node, struct btrfs_delayed_ref_node, - rb_node); - if (btrfs_delayed_ref_is_head(ref)) { - struct btrfs_delayed_ref_head *head; + head = rb_entry(node, struct btrfs_delayed_ref_head, + href_node); + if (btrfs_delayed_ref_is_head(&head->node)) { + struct btrfs_delayed_ref_node *ref; - head = btrfs_delayed_node_to_head(ref); + ref = &head->node; atomic_inc(&ref->refs); spin_unlock(&delayed_refs->lock); @@ -2795,6 +2799,8 @@ again: btrfs_put_delayed_ref(ref); cond_resched(); goto again; + } else { + WARN_ON(1); } node = rb_next(node); } @@ -5917,6 +5923,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, */ head->node.in_tree = 0; rb_erase(&head->node.rb_node, &delayed_refs->root); + rb_erase(&head->href_node, &delayed_refs->href_root); delayed_refs->num_entries--; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 8c81bdc..b9e0a46 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -62,7 +62,8 @@ static void put_transaction(struct btrfs_transaction *transaction) WARN_ON(atomic_read(&transaction->use_count) == 0); if (atomic_dec_and_test(&transaction->use_count)) { BUG_ON(!list_empty(&transaction->list)); - WARN_ON(transaction->delayed_refs.root.rb_node); + WARN_ON(!RB_EMPTY_ROOT(&transaction->delayed_refs.root)); + WARN_ON(!RB_EMPTY_ROOT(&transaction->delayed_refs.href_root)); while (!list_empty(&transaction->pending_chunks)) { struct extent_map *em; @@ -184,6 +185,7 @@ loop: cur_trans->start_time = get_seconds(); cur_trans->delayed_refs.root = RB_ROOT; + cur_trans->delayed_refs.href_root = RB_ROOT; cur_trans->delayed_refs.num_entries = 0; cur_trans->delayed_refs.num_heads_ready = 0; cur_trans->delayed_refs.num_heads = 0; -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Oct-14 04:59 UTC
[PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
It''s unnecessary to do qgroups accounting without enabling quota. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/ctree.c | 2 +- fs/btrfs/delayed-ref.c | 18 ++++++++++++++---- fs/btrfs/qgroup.c | 3 +++ 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 61b5bcd..fb89235 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info, tree_mod_log_write_lock(fs_info); spin_lock(&fs_info->tree_mod_seq_lock); - if (!elem->seq) { + if (elem && !elem->seq) { elem->seq = btrfs_inc_tree_mod_seq_major(fs_info); list_add_tail(&elem->list, &fs_info->tree_mod_seq_list); } diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9e1a1c9..3ec3d08 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref->is_head = 0; ref->in_tree = 1; - if (need_ref_seq(for_cow, ref_root)) - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); + if (need_ref_seq(for_cow, ref_root)) { + struct seq_list *elem = NULL; + + if (fs_info->quota_enabled) + elem = &trans->delayed_ref_elem; + seq = btrfs_get_tree_mod_seq(fs_info, elem); + } ref->seq = seq; full_ref = btrfs_delayed_node_to_tree_ref(ref); @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct btrfs_fs_info *fs_info, ref->is_head = 0; ref->in_tree = 1; - if (need_ref_seq(for_cow, ref_root)) - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); + if (need_ref_seq(for_cow, ref_root)) { + struct seq_list *elem = NULL; + + if (fs_info->quota_enabled) + elem = &trans->delayed_ref_elem; + seq = btrfs_get_tree_mod_seq(fs_info, elem); + } ref->seq = seq; full_ref = btrfs_delayed_node_to_data_ref(ref); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 4e6ef49..1cb58f9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, { struct qgroup_update *u; + if (!trans->root->fs_info->quota_enabled) + return 0; + BUG_ON(!trans->delayed_ref_elem.seq); u = kmalloc(sizeof(*u), GFP_NOFS); if (!u) -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This is a preparation step for online/inband dedup tree. It introduces dedup tree and its relatives, including hash driver and some structures. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/ctree.h | 73 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/disk-io.c | 41 +++++++++++++++++++++++ fs/btrfs/extent-tree.c | 2 + include/trace/events/btrfs.h | 3 +- 4 files changed, 118 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0506f40..4cc91c5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -33,6 +33,7 @@ #include <asm/kmap_types.h> #include <linux/pagemap.h> #include <linux/btrfs.h> +#include <crypto/hash.h> #include "extent_io.h" #include "extent_map.h" #include "async-thread.h" @@ -95,6 +96,9 @@ struct btrfs_ordered_sum; /* for storing items that use the BTRFS_UUID_KEY* types */ #define BTRFS_UUID_TREE_OBJECTID 9ULL +/* dedup tree(experimental) */ +#define BTRFS_DEDUP_TREE_OBJECTID 10ULL + /* for storing balance parameters in the root tree */ #define BTRFS_BALANCE_OBJECTID -4ULL @@ -515,6 +519,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL @@ -526,6 +531,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) /* @@ -897,6 +903,51 @@ struct btrfs_csum_item { u8 csum; } __attribute__ ((__packed__)); +/* dedup */ +enum btrfs_dedup_type { + BTRFS_DEDUP_SHA256 = 0, + BTRFS_DEDUP_LAST = 1, +}; + +static int btrfs_dedup_lens[] = { 4, 0 }; +static int btrfs_dedup_sizes[] = { 32, 0 }; /* 256bit, 32bytes */ + +struct btrfs_dedup_item { + /* disk length of dedup range */ + __le64 len; + + u8 type; + u8 compression; + u8 encryption; + + /* spare for later use */ + __le16 other_encoding; + + /* hash/fingerprints go here */ +} __attribute__ ((__packed__)); + +struct btrfs_dedup_hash { + u64 bytenr; + u64 num_bytes; + + /* hash algorithm */ + int type; + + int compression; + + /* last field is a variable length array of dedup hash */ + u64 hash[]; +}; + +static inline int btrfs_dedup_hash_size(int type) +{ + WARN_ON((btrfs_dedup_lens[type] * sizeof(u64)) !+ btrfs_dedup_sizes[type]); + + return sizeof(struct btrfs_dedup_hash) + btrfs_dedup_sizes[type]; +} + + struct btrfs_dev_stats_item { /* * grow this item struct at the end for future enhancements and keep @@ -1298,6 +1349,7 @@ struct btrfs_fs_info { struct btrfs_root *dev_root; struct btrfs_root *fs_root; struct btrfs_root *csum_root; + struct btrfs_root *dedup_root; struct btrfs_root *quota_root; struct btrfs_root *uuid_root; @@ -1650,6 +1702,14 @@ struct btrfs_fs_info { struct semaphore uuid_tree_rescan_sem; unsigned int update_uuid_tree_gen:1; + + /* reference to deduplication algorithm driver via cryptoapi */ + struct crypto_shash *dedup_driver; + + /* dedup blocksize */ + u64 dedup_bs; + + int dedup_type; }; /* @@ -1930,6 +1990,8 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_BALANCE_ITEM_KEY 248 +#define BTRFS_DEDUP_ITEM_KEY 251 + /* * Persistantly stores the io stats in the device tree. * One key for all stats, (0, BTRFS_DEV_STATS_KEY, devid). @@ -2974,6 +3036,14 @@ static inline u32 btrfs_file_extent_inline_item_len(struct extent_buffer *eb, return btrfs_item_size(eb, e) - offset; } +/* btrfs_dedup_item */ +BTRFS_SETGET_FUNCS(dedup_len, struct btrfs_dedup_item, len, 64); +BTRFS_SETGET_FUNCS(dedup_compression, struct btrfs_dedup_item, compression, 8); +BTRFS_SETGET_FUNCS(dedup_encryption, struct btrfs_dedup_item, encryption, 8); +BTRFS_SETGET_FUNCS(dedup_other_encoding, struct btrfs_dedup_item, + other_encoding, 16); +BTRFS_SETGET_FUNCS(dedup_type, struct btrfs_dedup_item, type, 8); + /* btrfs_dev_stats_item */ static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb, struct btrfs_dev_stats_item *ptr, @@ -3443,6 +3513,8 @@ static inline int btrfs_need_cleaner_sleep(struct btrfs_root *root) static inline void free_fs_info(struct btrfs_fs_info *fs_info) { + if (fs_info->dedup_driver) + crypto_free_shash(fs_info->dedup_driver); kfree(fs_info->balance_ctl); kfree(fs_info->delayed_root); kfree(fs_info->extent_root); @@ -3618,6 +3690,7 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans, u64 isize); int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); + /* inode.c */ struct btrfs_delalloc_work { struct inode *inode; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 289c09b..27b3739 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -154,6 +154,7 @@ static struct btrfs_lockdep_keyset { { .id = BTRFS_FS_TREE_OBJECTID, .name_stem = "fs" }, { .id = BTRFS_CSUM_TREE_OBJECTID, .name_stem = "csum" }, { .id = BTRFS_QUOTA_TREE_OBJECTID, .name_stem = "quota" }, + { .id = BTRFS_DEDUP_TREE_OBJECTID, .name_stem = "dedup" }, { .id = BTRFS_TREE_LOG_OBJECTID, .name_stem = "log" }, { .id = BTRFS_TREE_RELOC_OBJECTID, .name_stem = "treloc" }, { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc" }, @@ -1583,6 +1584,9 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info, if (location->objectid == BTRFS_UUID_TREE_OBJECTID) return fs_info->uuid_root ? fs_info->uuid_root : ERR_PTR(-ENOENT); + if (location->objectid == BTRFS_DEDUP_TREE_OBJECTID) + return fs_info->dedup_root ? fs_info->dedup_root : + ERR_PTR(-ENOENT); again: root = btrfs_lookup_fs_root(fs_info, location->objectid); if (root) { @@ -2050,6 +2054,12 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) info->uuid_root->node = NULL; info->uuid_root->commit_root = NULL; } + if (info->dedup_root) { + free_extent_buffer(info->dedup_root->node); + free_extent_buffer(info->dedup_root->commit_root); + info->dedup_root->node = NULL; + info->dedup_root->commit_root = NULL; + } if (chunk_root) { free_extent_buffer(info->chunk_root->node); free_extent_buffer(info->chunk_root->commit_root); @@ -2089,6 +2099,19 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info) } } +static struct crypto_shash * +btrfs_build_dedup_driver(struct btrfs_fs_info *info) +{ + switch (info->dedup_type) { + case BTRFS_DEDUP_SHA256: + return crypto_alloc_shash("sha256", 0, 0); + default: + pr_err("btrfs: unrecognized dedup type\n"); + break; + } + return ERR_PTR(-EINVAL); +} + int open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_devices, char *options) @@ -2111,6 +2134,7 @@ int open_ctree(struct super_block *sb, struct btrfs_root *dev_root; struct btrfs_root *quota_root; struct btrfs_root *uuid_root; + struct btrfs_root *dedup_root; struct btrfs_root *log_tree_root; int ret; int err = -EINVAL; @@ -2200,6 +2224,8 @@ int open_ctree(struct super_block *sb, atomic64_set(&fs_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = 8192 * 1024; + fs_info->dedup_bs = 0; + fs_info->dedup_type = BTRFS_DEDUP_SHA256; fs_info->metadata_ratio = 0; fs_info->defrag_inodes = RB_ROOT; fs_info->free_chunk_space = 0; @@ -2476,6 +2502,14 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + fs_info->dedup_driver = btrfs_build_dedup_driver(fs_info); + if (IS_ERR(fs_info->dedup_driver)) { + pr_info("BTRFS: Cannot load sha256 driver\n"); + err = PTR_ERR(fs_info->dedup_driver); + fs_info->dedup_driver = NULL; + goto fail_alloc; + } + btrfs_init_workers(&fs_info->generic_worker, "genwork", 1, NULL); @@ -2726,6 +2760,13 @@ retry_root_backup: generation != btrfs_super_uuid_tree_generation(disk_super); } + location.objectid = BTRFS_DEDUP_TREE_OBJECTID; + dedup_root = btrfs_read_tree_root(tree_root, &location); + if (!IS_ERR(dedup_root)) { + dedup_root->track_dirty = 1; + fs_info->dedup_root = dedup_root; + } + fs_info->generation = generation; fs_info->last_trans_committed = generation; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6d92c54..ee97f3a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4706,6 +4706,8 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info) if (fs_info->quota_root) fs_info->quota_root->block_rsv = &fs_info->global_block_rsv; fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv; + if (fs_info->dedup_root) + fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv; update_global_block_rsv(fs_info); } diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index f18b3b7..7914ee9 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -40,6 +40,7 @@ struct extent_buffer; { BTRFS_ROOT_TREE_DIR_OBJECTID, "ROOT_TREE_DIR" }, \ { BTRFS_CSUM_TREE_OBJECTID, "CSUM_TREE" }, \ { BTRFS_TREE_LOG_OBJECTID, "TREE_LOG" }, \ + { BTRFS_DEDUP_TREE_OBJECTID, "DEDUP_TREE" }, \ { BTRFS_QUOTA_TREE_OBJECTID, "QUOTA_TREE" }, \ { BTRFS_TREE_RELOC_OBJECTID, "TREE_RELOC" }, \ { BTRFS_UUID_TREE_OBJECTID, "UUID_RELOC" }, \ @@ -48,7 +49,7 @@ struct extent_buffer; #define show_root_type(obj) \ obj, ((obj >= BTRFS_DATA_RELOC_TREE_OBJECTID) || \ (obj >= BTRFS_ROOT_TREE_OBJECTID && \ - obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-" + obj <= BTRFS_DEDUP_TREE_OBJECTID)) ? __show_root_type(obj) : "-" #define BTRFS_GROUP_FLAGS \ { BTRFS_BLOCK_GROUP_DATA, "DATA"}, \ -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The operations consist of finding matched items, adding new items and removing items. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/ctree.h | 9 ++ fs/btrfs/file-item.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 219 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4cc91c5..b3a3489 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3691,6 +3691,15 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans, int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); +int noinline_for_stack +btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash); +int noinline_for_stack +btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_dedup_hash *hash); +int noinline_for_stack +btrfs_free_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 hash, u64 bytenr); /* inode.c */ struct btrfs_delalloc_work { struct inode *inode; diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 4f53159..ddb489e 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -883,3 +883,213 @@ out: fail_unlock: goto out; } + +/* 1 means we find one, 0 means we dont. */ +int noinline_for_stack +btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + struct btrfs_dedup_item *item; + u64 hash_value; + u64 length; + u64 dedup_size; + int compression; + int found = 0; + int index; + int ret; + + if (!hash) { + WARN_ON(1); + return 0; + } + if (!root->fs_info->dedup_root) { + WARN(1, KERN_INFO "dedup not enabled\n"); + return 0; + } + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return 0; + + /* + * For SHA256 dedup algorithm, we store the last 64bit as the + * key.objectid, and the rest in the tree item. + */ + index = btrfs_dedup_lens[hash->type] - 1; + dedup_size = btrfs_dedup_sizes[hash->type] - sizeof(u64); + + hash_value = hash->hash[index]; + + key.objectid = hash_value; + key.offset = (u64)-1; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + ret = btrfs_search_slot(NULL, dedup_root, &key, path, 0, 0); + if (ret < 0) + goto out; + if (ret == 0) { + WARN_ON(1); + goto out; + } + +prev_slot: + /* this will do match checks. */ + ret = btrfs_previous_item(dedup_root, path, hash_value, + BTRFS_DEDUP_ITEM_KEY); + if (ret) + goto out; + + leaf = path->nodes[0]; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (key.objectid != hash_value) + goto out; + + item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dedup_item); + /* disk length of dedup range */ + length = btrfs_dedup_len(leaf, item); + + compression = btrfs_dedup_compression(leaf, item); + if (compression > BTRFS_COMPRESS_TYPES) { + WARN_ON(1); + goto out; + } + + if (btrfs_dedup_type(leaf, item) != hash->type) + goto prev_slot; + + if (memcmp_extent_buffer(leaf, hash->hash, (unsigned long)(item + 1), + dedup_size)) { + pr_info("goto prev\n"); + goto prev_slot; + } + + hash->bytenr = key.offset; + hash->num_bytes = length; + hash->compression = compression; + found = 1; +out: + btrfs_free_path(path); + return found; +} + +int noinline_for_stack +btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_dedup_hash *hash) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + struct btrfs_dedup_item *dedup_item; + u64 ins_size; + u64 dedup_size; + int index; + int ret; + + if (!hash) { + WARN_ON(1); + return 0; + } + + WARN_ON(hash->num_bytes > root->fs_info->dedup_bs); + + if (!root->fs_info->dedup_root) { + WARN(1, KERN_INFO "dedup not enabled\n"); + return 0; + } + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + /* + * For SHA256 dedup algorithm, we store the last 64bit as the + * key.objectid, and the rest in the tree item. + */ + index = btrfs_dedup_lens[hash->type] - 1; + dedup_size = btrfs_dedup_sizes[hash->type] - sizeof(u64); + + ins_size = sizeof(*dedup_item) + dedup_size; + + key.objectid = hash->hash[index]; + key.offset = hash->bytenr; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + path->leave_spinning = 1; + ret = btrfs_insert_empty_item(trans, dedup_root, path, &key, ins_size); + if (ret < 0) + goto out; + leaf = path->nodes[0]; + + dedup_item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_dedup_item); + /* disk length of dedup range */ + btrfs_set_dedup_len(leaf, dedup_item, hash->num_bytes); + btrfs_set_dedup_compression(leaf, dedup_item, hash->compression); + btrfs_set_dedup_encryption(leaf, dedup_item, 0); + btrfs_set_dedup_other_encoding(leaf, dedup_item, 0); + btrfs_set_dedup_type(leaf, dedup_item, hash->type); + + write_extent_buffer(leaf, hash->hash, (unsigned long)(dedup_item + 1), + dedup_size); + + btrfs_mark_buffer_dirty(leaf); +out: + WARN_ON(ret == -EEXIST); + btrfs_free_path(path); + return ret; +} + +int noinline_for_stack +btrfs_free_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 hash, u64 bytenr) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + int ret = 0; + + if (!root->fs_info->dedup_root) + return 0; + + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return ret; + + key.objectid = hash; + key.offset = bytenr; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + ret = btrfs_search_slot(trans, dedup_root, &key, path, -1, 1); + if (ret < 0) + goto out; + if (ret) { + WARN_ON(1); + ret = -ENOENT; + goto out; + } + + leaf = path->nodes[0]; + + ret = -ENOENT; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (btrfs_key_type(&key) != BTRFS_DEDUP_ITEM_KEY) + goto out; + if (key.objectid != hash || key.offset != bytenr) + goto out; + + ret = btrfs_del_item(trans, dedup_root, path); + WARN_ON(ret); +out: + btrfs_free_path(path); + return ret; +} -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This introduces dedup state and relative operations to mark and unmark the dedup data range, it''ll be used in later patches. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/extent_io.c | 14 ++++++++++++++ fs/btrfs/extent_io.h | 5 +++++ 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 22bda32..f566d0c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1251,6 +1251,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, cached_state, mask); } +int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return set_extent_bit(tree, start, end, EXTENT_DEDUP, 0, + cached_state, mask); +} + +int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return clear_extent_bit(tree, start, end, EXTENT_DEDUP, 0, 0, + cached_state, mask); +} + /* * either insert or lock state struct between start and end use mask to tell * us if waiting is desired. diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 6dbc645..979aadb 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -20,6 +20,7 @@ #define EXTENT_NEED_WAIT (1 << 13) #define EXTENT_DAMAGED (1 << 14) #define EXTENT_NORESERVE (1 << 15) +#define EXTENT_DEDUP (1 << 16) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC) @@ -227,6 +228,10 @@ int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); +int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask); +int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask); int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask); int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This adds a dedup flag and dedup hash into ordered extent so that we can insert dedup extents to dedup tree at endio time. The benefit is simplicity, we don''t need to fall back to cleanup dedup structures if the write is cancelled for some reasons. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/ordered-data.c | 38 ++++++++++++++++++++++++++++++++------ fs/btrfs/ordered-data.h | 13 ++++++++++++- 2 files changed, 44 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index c702cb6..48fecc9 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -183,7 +183,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + int type, int dio, int compress_type, + int dedup, struct btrfs_dedup_hash *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_ordered_inode_tree *tree; @@ -199,10 +200,23 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, entry->start = start; entry->len = len; if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) && - !(type == BTRFS_ORDERED_NOCOW)) + !(type == BTRFS_ORDERED_NOCOW) && !dedup) entry->csum_bytes_left = disk_len; entry->disk_len = disk_len; entry->bytes_left = len; + entry->dedup = dedup; + entry->hash = NULL; + + if (!dedup && hash) { + entry->hash = kzalloc(btrfs_dedup_hash_size(hash->type), + GFP_NOFS); + if (!entry->hash) { + kmem_cache_free(btrfs_ordered_extent_cache, entry); + return -ENOMEM; + } + memcpy(entry->hash, hash, btrfs_dedup_hash_size(hash->type)); + } + entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; @@ -251,7 +265,17 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, 0, NULL); +} + +int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + int dedup, struct btrfs_dedup_hash *hash, + int compress_type) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0, + compress_type, dedup, hash); } int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, @@ -259,16 +283,17 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 1, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, 0, NULL); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int compress_type) + int type, int compress_type, + struct btrfs_dedup_hash *hash) { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - compress_type); + compress_type, 0, hash); } /* @@ -501,6 +526,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry) list_del(&sum->list); kfree(sum); } + kfree(entry->hash); kmem_cache_free(btrfs_ordered_extent_cache, entry); } } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 0c0b356..e5eac12 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -109,6 +109,9 @@ struct btrfs_ordered_extent { /* compression algorithm */ int compress_type; + /* whether this ordered extent is marked for dedup or not */ + int dedup; + /* reference count */ atomic_t refs; @@ -135,6 +138,9 @@ struct btrfs_ordered_extent { struct completion completion; struct btrfs_work flush_work; struct list_head work_list; + + /* dedup hash of sha256 type */ + struct btrfs_dedup_hash *hash; }; /* @@ -168,11 +174,16 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, int uptodate); int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type); +int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + int dedup, struct btrfs_dedup_hash *hash, + int compress_type); int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type); int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int compress_type); + int type, int compress_type, + struct btrfs_dedup_hash *hash); void btrfs_add_ordered_sum(struct inode *inode, struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The main part of data dedup. This introduces a FORMAT CHANGE. Btrfs provides online(inband/synchronous) and block-level dedup. It maps naturally to btrfs''s block back-reference, which enables us to store multiple copies of data as single copy with references on that copy. The workflow is (1) write some data, (2) get the hash of these data based on btrfs''s dedup blocksize. (3) find matched extents by hash and decide whether to mark it as a duplicate one or not. If no, write the data onto disk, otherwise, add a reference to the matched extent. Btrfs''s built-in dedup supports normal writes and compressed writes. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/extent-tree.c | 150 ++++++++++-- fs/btrfs/extent_io.c | 8 +- fs/btrfs/extent_io.h | 10 + fs/btrfs/inode.c | 640 ++++++++++++++++++++++++++++++++++++++++++------ 4 files changed, 711 insertions(+), 97 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ee97f3a..812f512 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1122,8 +1122,16 @@ static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans, key.offset = parent; } else { key.type = BTRFS_EXTENT_DATA_REF_KEY; - key.offset = hash_extent_data_ref(root_objectid, - owner, offset); + + /* + * we''ve not got the right offset and owner, so search by -1 + * here. + */ + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) + key.offset = (u64)-1; + else + key.offset = hash_extent_data_ref(root_objectid, + owner, offset); } again: recow = 0; @@ -1150,6 +1158,10 @@ again: goto fail; } + if (ret > 0 && root_objectid == BTRFS_DEDUP_TREE_OBJECTID && + path->slots[0] > 0) + path->slots[0]--; + leaf = path->nodes[0]; nritems = btrfs_header_nritems(leaf); while (1) { @@ -1173,14 +1185,22 @@ again: ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_data_ref); - if (match_extent_data_ref(leaf, ref, root_objectid, - owner, offset)) { - if (recow) { - btrfs_release_path(path); - goto again; + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + if (btrfs_extent_data_ref_root(leaf, ref) =+ root_objectid) { + err = 0; + break; + } + } else { + if (match_extent_data_ref(leaf, ref, root_objectid, + owner, offset)) { + if (recow) { + btrfs_release_path(path); + goto again; + } + err = 0; + break; } - err = 0; - break; } path->slots[0]++; } @@ -1324,6 +1344,32 @@ static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans, return ret; } +static noinline u64 extent_data_ref_offset(struct btrfs_root *root, + struct btrfs_path *path, + struct btrfs_extent_inline_ref *iref) +{ + struct btrfs_key key; + struct extent_buffer *leaf; + struct btrfs_extent_data_ref *ref1; + u64 offset = 0; + + leaf = path->nodes[0]; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (iref) { + WARN_ON(btrfs_extent_inline_ref_type(leaf, iref) !+ BTRFS_EXTENT_DATA_REF_KEY); + ref1 = (struct btrfs_extent_data_ref *)(&iref->offset); + offset = btrfs_extent_data_ref_offset(leaf, ref1); + } else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) { + ref1 = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_extent_data_ref); + offset = btrfs_extent_data_ref_offset(leaf, ref1); + } else { + WARN_ON(1); + } + return offset; +} + static noinline u32 extent_data_ref_count(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_extent_inline_ref *iref) @@ -1591,7 +1637,8 @@ again: err = -ENOENT; while (1) { if (ptr >= end) { - WARN_ON(ptr > end); + WARN_ON(ptr > end && + root_objectid != BTRFS_DEDUP_TREE_OBJECTID); break; } iref = (struct btrfs_extent_inline_ref *)ptr; @@ -1606,14 +1653,25 @@ again: if (type == BTRFS_EXTENT_DATA_REF_KEY) { struct btrfs_extent_data_ref *dref; dref = (struct btrfs_extent_data_ref *)(&iref->offset); - if (match_extent_data_ref(leaf, dref, root_objectid, - owner, offset)) { - err = 0; - break; + + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + if (btrfs_extent_data_ref_root(leaf, dref) =+ root_objectid) { + err = 0; + break; + } + } else { + if (match_extent_data_ref(leaf, dref, + root_objectid, owner, + offset)) { + err = 0; + break; + } + if (hash_extent_data_ref_item(leaf, dref) < + hash_extent_data_ref(root_objectid, owner, + offset)) + break; } - if (hash_extent_data_ref_item(leaf, dref) < - hash_extent_data_ref(root_objectid, owner, offset)) - break; } else { u64 ref_offset; ref_offset = btrfs_extent_inline_ref_offset(leaf, iref); @@ -5631,11 +5689,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_extent_inline_ref *iref; int ret; int is_data; - int extent_slot = 0; - int found_extent = 0; - int num_to_del = 1; + int extent_slot; + int found_extent; + int num_to_del; u32 item_size; u64 refs; + u64 orig_root_obj; + u64 dedup_hash; bool skinny_metadata = btrfs_fs_incompat(root->fs_info, SKINNY_METADATA); @@ -5643,6 +5703,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, if (!path) return -ENOMEM; +again: + extent_slot = 0; + found_extent = 0; + num_to_del = 1; + orig_root_obj = root_objectid; + dedup_hash = 0; + path->reada = 1; path->leave_spinning = 1; @@ -5684,6 +5751,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, #endif if (!found_extent) { BUG_ON(iref); + + if (is_data && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + dedup_hash = extent_data_ref_offset(root, path, + NULL); + } ret = remove_extent_backref(trans, extent_root, path, NULL, refs_to_drop, is_data); @@ -5741,6 +5814,10 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, } extent_slot = path->slots[0]; } + } else if (ret == -ENOENT && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + ret = 0; + goto out; } else if (ret == -ENOENT) { btrfs_print_leaf(extent_root, path->nodes[0]); WARN_ON(1); @@ -5834,7 +5911,28 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, } add_pinned_bytes(root->fs_info, -num_bytes, owner_objectid, root_objectid); + + /* + * special case for dedup + * + * We assume the last ref(ref==1) is backref pointing to dedup. + * + * root_obj == 1 means that it''s a free space cache inode, + * and it always uses PREALLOC, so it never has dedup extent. + */ + if (is_data && refs == 1 && + orig_root_obj != BTRFS_ROOT_TREE_OBJECTID) { + btrfs_release_path(path); + root_objectid = BTRFS_DEDUP_TREE_OBJECTID; + parent = 0; + + goto again; + } } else { + if (!dedup_hash && is_data && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) + dedup_hash = extent_data_ref_offset(root, path, iref); + if (found_extent) { BUG_ON(is_data && refs_to_drop ! extent_data_ref_count(root, path, iref)); @@ -5861,6 +5959,18 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, extent_root, ret); goto out; } + + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + ret = btrfs_free_dedup_extent(trans, root, + dedup_hash, + bytenr); + if (ret) { + btrfs_abort_transaction(trans, + extent_root, + ret); + goto out; + } + } } ret = update_block_group(root, bytenr, num_bytes, 0); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f566d0c..13a7908 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2356,7 +2356,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end) * Scheduling is not allowed, so the extent state tree is expected * to have one and only one object corresponding to this IO. */ -static void end_bio_extent_writepage(struct bio *bio, int err) +void end_bio_extent_writepage(struct bio *bio, int err) { struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; struct extent_io_tree *tree; @@ -2607,8 +2607,8 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) } -static int __must_check submit_one_bio(int rw, struct bio *bio, - int mirror_num, unsigned long bio_flags) +int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num, + unsigned long bio_flags) { int ret = 0; struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; @@ -2647,7 +2647,7 @@ static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, } -static int submit_extent_page(int rw, struct extent_io_tree *tree, +int submit_extent_page(int rw, struct extent_io_tree *tree, struct page *page, sector_t sector, size_t size, unsigned long offset, struct block_device *bdev, diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 979aadb..f2b87f2 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -350,4 +350,14 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, int end_extent_writepage(struct page *page, int err, u64 start, u64 end); int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, int mirror_num); +int submit_extent_page(int rw, struct extent_io_tree *tree, struct page *page, + sector_t sector, size_t size, unsigned long offset, + struct block_device *bdev, struct bio **bio_ret, + unsigned long max_pages, bio_end_io_t end_io_func, + int mirror_num, unsigned long prev_bio_flags, + unsigned long bio_flags); +void end_bio_extent_writepage(struct bio *bio, int err); +int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num, + unsigned long bio_flags); + #endif diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 22ebc13..00fd6e9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -43,6 +43,7 @@ #include <linux/btrfs.h> #include <linux/blkdev.h> #include <linux/posix_acl_xattr.h> +#include <asm/unaligned.h> #include "compat.h" #include "ctree.h" #include "disk-io.h" @@ -105,6 +106,17 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, u64 block_start, u64 block_len, u64 orig_block_len, u64 ram_bytes, int type); +static noinline int cow_file_range_dedup(struct inode *inode, + struct page *locked_page, + u64 start, u64 end, int *page_started, + unsigned long *nr_written, int unlock, + struct btrfs_dedup_hash *hash); +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode, + struct page *locked_page, u64 start, u64 end, + get_extent_t *get_extent, int mode, + struct btrfs_dedup_hash *hash); +static int btrfs_inode_test_compress(struct inode *inode); + static int btrfs_dirty_inode(struct inode *inode); @@ -297,6 +309,7 @@ struct async_extent { unsigned long nr_pages; int compress_type; struct list_head list; + struct btrfs_dedup_hash *hash; /* dedup hash of sha256 */ }; struct async_cow { @@ -314,22 +327,41 @@ static noinline int add_async_extent(struct async_cow *cow, u64 compressed_size, struct page **pages, unsigned long nr_pages, - int compress_type) + int compress_type, + struct btrfs_dedup_hash *h) { struct async_extent *async_extent; async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS); - BUG_ON(!async_extent); /* -ENOMEM */ + if (!async_extent) + return -ENOMEM; async_extent->start = start; async_extent->ram_size = ram_size; async_extent->compressed_size = compressed_size; async_extent->pages = pages; async_extent->nr_pages = nr_pages; async_extent->compress_type = compress_type; + async_extent->hash = NULL; + if (h) { + async_extent->hash = kmalloc(btrfs_dedup_hash_size(h->type), + GFP_NOFS); + if (!async_extent->hash) { + kfree(async_extent); + return -ENOMEM; + } + memcpy(async_extent->hash, h, btrfs_dedup_hash_size(h->type)); + } + list_add_tail(&async_extent->list, &cow->extents); return 0; } +static noinline void free_async_extent(struct async_extent *p) +{ + kfree(p->hash); + kfree(p); +} + /* * we create compressed extents in two phases. The first * phase compresses a range of pages that have already been @@ -351,7 +383,8 @@ static noinline int compress_file_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, struct async_cow *async_cow, - int *num_added) + int *num_added, + struct btrfs_dedup_hash *dedup_hash) { struct btrfs_root *root = BTRFS_I(inode)->root; u64 num_bytes; @@ -419,9 +452,7 @@ again: * change at any time if we discover bad compression ratios. */ if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) && - (btrfs_test_opt(root, COMPRESS) || - (BTRFS_I(inode)->force_compress) || - (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))) { + btrfs_inode_test_compress(inode)) { WARN_ON(pages); pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS); if (!pages) { @@ -549,9 +580,11 @@ cont: * allocation on disk for these compressed pages, * and will submit them to the elevator. */ - add_async_extent(async_cow, start, num_bytes, - total_compressed, pages, nr_pages_ret, - compress_type); + ret = add_async_extent(async_cow, start, num_bytes, + total_compressed, pages, nr_pages_ret, + compress_type, dedup_hash); + if (ret) + goto free_pages_out; if (start + num_bytes < end) { start += num_bytes; @@ -575,8 +608,11 @@ cleanup_and_bail_uncompressed: } if (redirty) extent_range_redirty_for_io(inode, start, end); - add_async_extent(async_cow, start, end - start + 1, - 0, NULL, 0, BTRFS_COMPRESS_NONE); + ret = add_async_extent(async_cow, start, end - start + 1, + 0, NULL, 0, BTRFS_COMPRESS_NONE, dedup_hash); + if (ret) + goto free_pages_out; + *num_added += 1; } @@ -625,38 +661,15 @@ again: retry: /* did the compression code fall back to uncompressed IO? */ if (!async_extent->pages) { - int page_started = 0; - unsigned long nr_written = 0; - - lock_extent(io_tree, async_extent->start, - async_extent->start + - async_extent->ram_size - 1); - - /* allocate blocks */ - ret = cow_file_range(inode, async_cow->locked_page, - async_extent->start, - async_extent->start + - async_extent->ram_size - 1, - &page_started, &nr_written, 0); - - /* JDM XXX */ + ret = run_locked_range(io_tree, inode, + async_cow->locked_page, + async_extent->start, + async_extent->start + + async_extent->ram_size - 1, + btrfs_get_extent, WB_SYNC_ALL, + async_extent->hash); - /* - * if page_started, cow_file_range inserted an - * inline extent and took care of all the unlocking - * and IO for us. Otherwise, we need to submit - * all those pages down to the drive. - */ - if (!page_started && !ret) - extent_write_locked_range(io_tree, - inode, async_extent->start, - async_extent->start + - async_extent->ram_size - 1, - btrfs_get_extent, - WB_SYNC_ALL); - else if (ret) - unlock_page(async_cow->locked_page); - kfree(async_extent); + free_async_extent(async_extent); cond_resched(); continue; } @@ -739,7 +752,8 @@ retry: async_extent->ram_size, ins.offset, BTRFS_ORDERED_COMPRESSED, - async_extent->compress_type); + async_extent->compress_type, + async_extent->hash); if (ret) goto out_free_reserve; @@ -759,7 +773,7 @@ retry: ins.offset, async_extent->pages, async_extent->nr_pages); alloc_hint = ins.objectid + ins.offset; - kfree(async_extent); + free_async_extent(async_extent); if (ret) goto out; cond_resched(); @@ -777,10 +791,366 @@ out_free: EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING, PAGE_UNLOCK | PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); - kfree(async_extent); + free_async_extent(async_extent); goto again; } +static void btrfs_dedup_hash_final(struct btrfs_dedup_hash *hash); + +static int btrfs_dedup_hash_digest(struct btrfs_root *root, const char *data, + u64 length, struct btrfs_dedup_hash *hash) +{ + struct crypto_shash *tfm = root->fs_info->dedup_driver; + struct { + struct shash_desc desc; + char ctx[crypto_shash_descsize(tfm)]; + } sdesc; + int ret; + + sdesc.desc.tfm = tfm; + sdesc.desc.flags = 0; + + ret = crypto_shash_digest(&sdesc.desc, data, length, + (char *)(hash->hash)); + if (!ret) + btrfs_dedup_hash_final(hash); + return ret; +} + +static void btrfs_dedup_hash_final(struct btrfs_dedup_hash *hash) +{ + int num, i; + + num = btrfs_dedup_lens[hash->type] - 1; + for (i = 0; i < num; i++) + put_unaligned_le64(hash->hash[i], (char *)(hash->hash + i)); +} + +static int btrfs_calc_dedup_hash(struct btrfs_root *root, struct inode *inode, + u64 start, struct btrfs_dedup_hash *hash) +{ + struct page *p; + char *data; + u64 length = root->fs_info->dedup_bs; + u64 blocksize = root->sectorsize; + int err; + + if (length == blocksize) { + p = find_get_page(inode->i_mapping, + (start >> PAGE_CACHE_SHIFT)); + WARN_ON(!p); /* page should be here */ + data = kmap_atomic(p); + err = btrfs_dedup_hash_digest(root, data, length, hash); + kunmap_atomic(data); + page_cache_release(p); + } else { + char *d; + int i = 0; + + data = kmalloc(length, GFP_NOFS); + if (!data) + return -ENOMEM; + + while (blocksize * i < length) { + p = find_get_page(inode->i_mapping, + (start >> PAGE_CACHE_SHIFT) + i); + WARN_ON(!p); /* page should be here */ + d = kmap_atomic(p); + memcpy((data + blocksize * i), d, blocksize); + kunmap_atomic(d); + page_cache_release(p); + i++; + } + + err = btrfs_dedup_hash_digest(root, data, length, hash); + kfree(data); + } + return err; +} + +static noinline int +run_delalloc_dedup(struct inode *inode, struct page *locked_page, u64 start, + u64 end, struct async_cow *async_cow) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + struct bio *bio = NULL; + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + struct extent_map *em; + struct page *page = NULL; + struct block_device *bdev; + struct btrfs_key ins; + u64 blocksize = root->sectorsize; + u64 num_bytes; + u64 cur_alloc_size; + u64 cur_end; + u64 alloc_hint = 0; + u64 iosize; + u64 dedup_bs = root->fs_info->dedup_bs; + int compr; + int found; + int type = 0; + sector_t sector; + int ret = 0; + struct extent_state *cached_state = NULL; + struct btrfs_dedup_hash *hash; + int dedup_type = root->fs_info->dedup_type; + + WARN_ON(btrfs_is_free_space_inode(inode)); + + num_bytes = ALIGN(end - start + 1, blocksize); + num_bytes = max(blocksize, num_bytes); + + hash = kzalloc(btrfs_dedup_hash_size(dedup_type), GFP_NOFS); + if (!hash) { + ret = -ENOMEM; + goto out; + } + + btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0); + + while (num_bytes > 0) { + unsigned long op = 0; + + /* page has been locked by caller */ + page = find_get_page(inode->i_mapping, + start >> PAGE_CACHE_SHIFT); + WARN_ON(!page); /* page should be here */ + + /* already ordered? */ + if (PagePrivate2(page)) + goto submit; + + /* too small data, go for normal path */ + if (num_bytes < dedup_bs) { + cur_end = start + num_bytes - 1; + + if (btrfs_inode_test_compress(inode)) { + int num_added = 0; + compress_file_range(inode, page, start, cur_end, + async_cow, &num_added, + NULL); + } else { + /* Now locked_page is not dirty. */ + if (page_offset(locked_page) >= start && + page_offset(locked_page) <= cur_end) { + __set_page_dirty_nobuffers(locked_page); + } + + ret = run_locked_range(tree, inode, page, start, + cur_end, + btrfs_get_extent, + WB_SYNC_ALL, NULL); + if (ret) + SetPageError(page); + } + + page_cache_release(page); + page = NULL; + + num_bytes -= num_bytes; + start += num_bytes; + cond_resched(); + continue; + } + + cur_alloc_size = min_t(u64, num_bytes, dedup_bs); + WARN_ON(cur_alloc_size < dedup_bs); /* shouldn''t happen */ + cur_end = start + cur_alloc_size - 1; + + /* see comments in compress_file_range */ + extent_range_clear_dirty_for_io(inode, start, cur_end); + + memset(hash, 0, btrfs_dedup_hash_size(dedup_type)); + hash->type = dedup_type; + + ret = btrfs_calc_dedup_hash(root, inode, start, hash); + + if (ret) { + found = 0; + compr = BTRFS_COMPRESS_NONE; + } else { + found = btrfs_find_dedup_extent(root, hash); + compr = hash->compression; + } + + if (found == 0) { + /* + * compress fastpath. + * so we take the original data as dedup string instead + * of compressed data since compression methods and data + * from them vary a lot. + */ + if (btrfs_inode_test_compress(inode)) { + int num_added = 0; + + extent_range_redirty_for_io(inode, start, + cur_end); + + compress_file_range(inode, page, start, cur_end, + async_cow, &num_added, + hash); + + page_cache_release(page); + page = NULL; + + num_bytes -= cur_alloc_size; + start += cur_alloc_size; + cond_resched(); + continue; + } + + /* no compress */ + ret = btrfs_reserve_extent(root, cur_alloc_size, + cur_alloc_size, 0, alloc_hint, + &ins, 1); + if (ret < 0) + goto out_unlock; + } else { /* found same hash */ + ins.objectid = hash->bytenr; + ins.offset = hash->num_bytes; + + set_extent_dedup(tree, start, cur_end, &cached_state, + GFP_NOFS); + } + + lock_extent(tree, start, cur_end); + + em = alloc_extent_map(); + if (!em) { + ret = -ENOMEM; + goto out_reserve; + } + em->start = start; + em->orig_start = em->start; + em->len = cur_alloc_size; + em->mod_start = em->start; + em->mod_len = em->len; + + em->block_start = ins.objectid; + em->block_len = ins.offset; + em->orig_block_len = ins.offset; + em->bdev = root->fs_info->fs_devices->latest_bdev; + set_bit(EXTENT_FLAG_PINNED, &em->flags); + em->generation = -1; + if (compr > BTRFS_COMPRESS_NONE) { + em->compress_type = compr; + set_bit(EXTENT_FLAG_COMPRESSED, &em->flags); + type = BTRFS_ORDERED_COMPRESSED; + } + + while (1) { + write_lock(&em_tree->lock); + ret = add_extent_mapping(em_tree, em, 1); + write_unlock(&em_tree->lock); + if (ret != -EEXIST) { + free_extent_map(em); + break; + } + btrfs_drop_extent_cache(inode, start, cur_end, 0); + } + if (ret) + goto out_reserve; + + ret = btrfs_add_ordered_extent_dedup(inode, start, ins.objectid, + cur_alloc_size, ins.offset, + type, found, hash, compr); + if (ret) + goto out_reserve; + + /* + * Do set the Private2 bit so we know this page was properly + * setup for writepage + */ + op |= PAGE_SET_PRIVATE2 | PAGE_SET_WRITEBACK | PAGE_CLEAR_DIRTY; + extent_clear_unlock_delalloc(inode, start, cur_end, + NULL, + EXTENT_LOCKED | EXTENT_DELALLOC, + op); + +submit: + iosize = blocksize; + + found = test_range_bit(tree, start, start + iosize - 1, + EXTENT_DEDUP, 0, cached_state); + if (found == 0) { + em = btrfs_get_extent(inode, page, 0, start, blocksize, + 1); + if (IS_ERR(em)) { + /* btrfs_get_extent will not return NULL */ + ret = PTR_ERR(em); + goto out_reserve; + } + + sector = (em->block_start + start - em->start) >> 9; + bdev = em->bdev; + free_extent_map(em); + em = NULL; + + /* TODO: rw can be WRTIE_SYNC */ + ret = submit_extent_page(WRITE, tree, page, sector, + iosize, 0, bdev, &bio, + 0, /* max_nr is no used */ + end_bio_extent_writepage, + 0, 0, 0); + if (ret) + break; + } else { + clear_extent_dedup(tree, start, start + iosize - 1, + &cached_state, GFP_NOFS); + + end_extent_writepage(page, 0, start, + start + iosize - 1); + /* we need to do this ourselves because we skip IO */ + end_page_writeback(page); + } + + unlock_page(page); + page_cache_release(page); + page = NULL; + + num_bytes -= blocksize; + alloc_hint = ins.objectid + blocksize; + start += blocksize; + cond_resched(); + } + +out_unlock: + if (bio) { + if (ret) + bio_put(bio); + else + ret = submit_one_bio(WRITE, bio, 0, 0); + bio = NULL; + } + + if (ret && page) + SetPageError(page); + if (page) { + unlock_page(page); + page_cache_release(page); + } + +out: + if (ret && num_bytes > 0) + extent_clear_unlock_delalloc(inode, + start, start + num_bytes - 1, + NULL, + EXTENT_DELALLOC | EXTENT_LOCKED | + EXTENT_DEDUP | EXTENT_DEFRAG, + PAGE_UNLOCK | PAGE_SET_WRITEBACK | + PAGE_END_WRITEBACK | PAGE_CLEAR_DIRTY); + + kfree(hash); + free_extent_state(cached_state); + return ret; + +out_reserve: + if (found == 0) + btrfs_free_reserved_extent(root, ins.objectid, ins.offset); + goto out_unlock; +} + static u64 get_extent_allocation_hint(struct inode *inode, u64 start, u64 num_bytes) { @@ -826,11 +1196,11 @@ static u64 get_extent_allocation_hint(struct inode *inode, u64 start, * required to start IO on it. It may be clean and already done with * IO when we return. */ -static noinline int cow_file_range(struct inode *inode, +static noinline int __cow_file_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, int *page_started, unsigned long *nr_written, - int unlock) + int unlock, struct btrfs_dedup_hash *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; u64 alloc_hint = 0; @@ -926,8 +1296,16 @@ static noinline int cow_file_range(struct inode *inode, goto out_reserve; cur_alloc_size = ins.offset; - ret = btrfs_add_ordered_extent(inode, start, ins.objectid, - ram_size, cur_alloc_size, 0); + if (!hash) + ret = btrfs_add_ordered_extent(inode, start, + ins.objectid, ram_size, + cur_alloc_size, 0); + else + ret = btrfs_add_ordered_extent_dedup(inode, start, + ins.objectid, ram_size, + cur_alloc_size, 0, 0, + hash, + BTRFS_COMPRESS_NONE); if (ret) goto out_reserve; @@ -975,21 +1353,76 @@ out_unlock: goto out; } +static noinline int cow_file_range(struct inode *inode, + struct page *locked_page, + u64 start, u64 end, int *page_started, + unsigned long *nr_written, + int unlock) +{ + return __cow_file_range(inode, locked_page, start, end, page_started, + nr_written, unlock, NULL); +} + +static noinline int cow_file_range_dedup(struct inode *inode, + struct page *locked_page, + u64 start, u64 end, int *page_started, + unsigned long *nr_written, + int unlock, struct btrfs_dedup_hash *hash) +{ + return __cow_file_range(inode, locked_page, start, end, page_started, + nr_written, unlock, hash); +} + +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode, + struct page *locked_page, u64 start, u64 end, + get_extent_t *get_extent, int mode, + struct btrfs_dedup_hash *hash) +{ + int page_started = 0; + unsigned long nr_written = 0; + int ret; + + lock_extent(tree, start, end); + + /* allocate blocks */ + ret = cow_file_range_dedup(inode, locked_page, start, end, + &page_started, &nr_written, 0, hash); + + /* + * if page_started, cow_file_range inserted an + * inline extent and took care of all the unlocking + * and IO for us. Otherwise, we need to submit + * all those pages down to the drive. + */ + if (!page_started && !ret) + extent_write_locked_range(tree, inode, start, end, get_extent, + mode); + else if (ret) + unlock_page(locked_page); + + return ret; +} + /* * work queue call back to started compression on a file and pages */ static noinline void async_cow_start(struct btrfs_work *work) { struct async_cow *async_cow; - int num_added = 0; async_cow = container_of(work, struct async_cow, work); - compress_file_range(async_cow->inode, async_cow->locked_page, - async_cow->start, async_cow->end, async_cow, - &num_added); - if (num_added == 0) { - btrfs_add_delayed_iput(async_cow->inode); - async_cow->inode = NULL; + if (async_cow->root->fs_info->dedup_bs != 0) { + run_delalloc_dedup(async_cow->inode, async_cow->locked_page, + async_cow->start, async_cow->end, async_cow); + } else { + int num_added = 0; + compress_file_range(async_cow->inode, async_cow->locked_page, + async_cow->start, async_cow->end, async_cow, + &num_added, NULL); + if (num_added == 0) { + btrfs_add_delayed_iput(async_cow->inode); + async_cow->inode = NULL; + } } } @@ -1387,6 +1820,19 @@ error: return ret; } +static int btrfs_inode_test_compress(struct inode *inode) +{ + struct btrfs_inode *bi = BTRFS_I(inode); + struct btrfs_root *root = BTRFS_I(inode)->root; + + if (btrfs_test_opt(root, COMPRESS) || + bi->force_compress || + bi->flags & BTRFS_INODE_COMPRESS) + return 1; + + return 0; +} + /* * extent_io.c call back to do delayed allocation processing */ @@ -1396,21 +1842,21 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, { int ret; struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_inode *bi = BTRFS_I(inode); - if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) { + if (bi->flags & BTRFS_INODE_NODATACOW) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); - } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC) { + } else if (bi->flags & BTRFS_INODE_PREALLOC) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!btrfs_test_opt(root, COMPRESS) && - !(BTRFS_I(inode)->force_compress) && - !(BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS)) { + } else if (!btrfs_inode_test_compress(inode) && + root->fs_info->dedup_bs == 0) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, - &BTRFS_I(inode)->runtime_flags); + &bi->runtime_flags); ret = cow_file_range_async(inode, locked_page, start, end, page_started, nr_written); } @@ -1831,12 +2277,14 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end) return -EBUSY; } -static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, - struct inode *inode, u64 file_pos, - u64 disk_bytenr, u64 disk_num_bytes, - u64 num_bytes, u64 ram_bytes, - u8 compression, u8 encryption, - u16 other_encoding, int extent_type) +static int __insert_reserved_file_extent(struct btrfs_trans_handle *trans, + struct inode *inode, u64 file_pos, + u64 disk_bytenr, u64 disk_num_bytes, + u64 num_bytes, u64 ram_bytes, + u8 compression, u8 encryption, + u16 other_encoding, int extent_type, + int dedup, + struct btrfs_dedup_hash *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_file_extent_item *fi; @@ -1893,15 +2341,59 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, ins.objectid = disk_bytenr; ins.offset = disk_num_bytes; ins.type = BTRFS_EXTENT_ITEM_KEY; - ret = btrfs_alloc_reserved_file_extent(trans, root, - root->root_key.objectid, - btrfs_ino(inode), file_pos, &ins); + + if (!dedup) { + ret = btrfs_alloc_reserved_file_extent(trans, root, + root->root_key.objectid, + btrfs_ino(inode), + file_pos, &ins); + if (ret) + goto out; + + if (hash) { + int index = btrfs_dedup_lens[hash->type] - 1; + + hash->bytenr = ins.objectid; + hash->num_bytes = ins.offset; + hash->compression = compression; + ret = btrfs_insert_dedup_extent(trans, root, hash); + if (ret) + goto out; + + ret = btrfs_inc_extent_ref(trans, root, ins.objectid, + ins.offset, 0, + BTRFS_DEDUP_TREE_OBJECTID, + btrfs_ino(inode), + hash->hash[index], 0); + } + } else { + ret = btrfs_inc_extent_ref(trans, root, ins.objectid, + ins.offset, 0, + root->root_key.objectid, + btrfs_ino(inode), + file_pos, /* file_pos - 0 */ + 0); + } out: btrfs_free_path(path); return ret; } +static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, + struct inode *inode, u64 file_pos, + u64 disk_bytenr, u64 disk_num_bytes, + u64 num_bytes, u64 ram_bytes, + u8 compression, u8 encryption, + u16 other_encoding, int extent_type) +{ + return __insert_reserved_file_extent(trans, inode, file_pos, + disk_bytenr, disk_num_bytes, + num_bytes, ram_bytes, compression, + encryption, other_encoding, + extent_type, 0, NULL); +} + /* snapshot-aware defrag */ struct sa_defrag_extent_backref { struct rb_node node; @@ -2640,13 +3132,15 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) logical_len); } else { BUG_ON(root == root->fs_info->tree_root); - ret = insert_reserved_file_extent(trans, inode, + ret = __insert_reserved_file_extent(trans, inode, ordered_extent->file_offset, ordered_extent->start, ordered_extent->disk_len, logical_len, logical_len, compress_type, 0, 0, - BTRFS_FILE_EXTENT_REG); + BTRFS_FILE_EXTENT_REG, + ordered_extent->dedup, + ordered_extent->hash); } unpin_extent_cache(&BTRFS_I(inode)->extent_tree, ordered_extent->file_offset, ordered_extent->len, -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Oct-14 04:59 UTC
[PATCH v7 10/13] Btrfs: skip dedup reference during backref walking
The dedup ref is quite a special one, it is just used to store the hash value of the extent and cannot be used to find data, so we skip it during backref walking. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/backref.c | 9 +++++++++ fs/btrfs/relocation.c | 3 +++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 0552a59..6ff57d2 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -588,6 +588,9 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, key.objectid = ref->objectid; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = ref->offset; + if (ref->root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, ref->root, &key, 0, 0, node->bytenr, node->ref_mod * sgn, GFP_ATOMIC); @@ -706,6 +709,9 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info, key.type = BTRFS_EXTENT_DATA_KEY; key.offset = btrfs_extent_data_ref_offset(leaf, dref); root = btrfs_extent_data_ref_root(leaf, dref); + if (root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, root, &key, 0, 0, bytenr, count, GFP_NOFS); break; @@ -789,6 +795,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info, key.type = BTRFS_EXTENT_DATA_KEY; key.offset = btrfs_extent_data_ref_offset(leaf, dref); root = btrfs_extent_data_ref_root(leaf, dref); + if (root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, root, &key, 0, 0, bytenr, count, GFP_NOFS); break; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index a5a2632..d9bb01f 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3486,6 +3486,9 @@ static int find_data_references(struct reloc_control *rc, ref_offset = btrfs_extent_data_ref_offset(leaf, ref); ref_count = btrfs_extent_data_ref_count(leaf, ref); + if (ref_root == BTRFS_DEDUP_TREE_OBJECTID) + return 0; + /* * This is an extent belonging to the free space cache, lets just delete * it and redo the search. -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
If the ordered extent had an IOERR or something else went wrong we need to return the space for this ordered extent back to the allocator, but if the extent is marked as a dedup one, we don''t free the space because we just use the existing space instead of allocating new space. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/inode.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 00fd6e9..6ab6246 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3190,6 +3190,7 @@ out: * truncated case if we didn''t write out the extent at all. */ if ((ret || !logical_len) && + !ordered_extent->dedup && !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) btrfs_free_reserved_extent(root, ordered_extent->start, -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The dedup reference is a special kind of delayed refs, and the delayed refs are batched to be processed later. If we find a matched dedup extent, then we queue an ADD delayed ref on it during endio work, but there is already a DROP delayed ref queued there, t1 t2 t3 ->writepage commit transaction ->run_delalloc_dedup find_dedup ------------------------------------------------------------------------------ process_delayed refs add ordered extent submit pages finish ordered io insert file extents queue delayed refs queue dedup ref This senario ends up with a crash because we''re going to insert a ref on a deleted extent. To avoid the race, we need to wait for processing delayed refs before finding matched dedup extents. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/file-item.c | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index ddb489e..8933e13 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -893,6 +893,8 @@ btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash) struct extent_buffer *leaf; struct btrfs_root *dedup_root; struct btrfs_dedup_item *item; + struct btrfs_delayed_ref_root *delayed_refs; + struct btrfs_trans_handle *trans; u64 hash_value; u64 length; u64 dedup_size; @@ -911,6 +913,24 @@ btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash) } dedup_root = root->fs_info->dedup_root; + /* + * This is for serializing the dedup reference add/remove, + * the dedup reference is one of delayed refs, so it''s likely + * we find the dedup extent here but there is already a DROP ref + * on it, and this ends up that we insert a ref on a deleted + * extent and get crash. + * Therefore, before finding matched dedup extents, we should + * wait for delayed_ref running''s finish. + */ + trans = btrfs_join_transaction(root); + if (!IS_ERR(trans)) { + delayed_refs = &trans->transaction->delayed_refs; + if (delayed_refs && atomic_read(&delayed_refs->procs_running_refs)) + wait_event(delayed_refs->wait, + atomic_read(&delayed_refs->procs_running_refs) == 0); + btrfs_end_transaction(trans, root); + } + path = btrfs_alloc_path(); if (!path) return 0; -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
So far we have 4 commands to control dedup behaviour, - btrfs dedup enable Create the dedup tree, and it''s the very first step when you''re going to use the dedup feature. - btrfs dedup disable Delete the dedup tree, after this we''re not able to use dedup any more unless you enable it again. - btrfs dedup on [-b] Switch on the dedup feature temporarily, and it''s the second step of applying dedup with writes. Option ''-b'' is used to set dedup blocksize. The default blocksize is 8192(no special reason, you may argue), and the current limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the upper limit of btrfs''s compression. - btrfs dedup off Switch off the dedup feature temporarily, but the dedup tree remains. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/ctree.h | 3 + fs/btrfs/disk-io.c | 1 + fs/btrfs/ioctl.c | 167 ++++++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/btrfs.h | 11 +++ 4 files changed, 182 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b3a3489..798b183 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1710,6 +1710,9 @@ struct btrfs_fs_info { u64 dedup_bs; int dedup_type; + + /* protect user change for dedup operations */ + struct mutex dedup_ioctl_mutex; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 27b3739..967397c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2327,6 +2327,7 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount); mutex_init(&fs_info->dev_replace.lock_management_lock); mutex_init(&fs_info->dev_replace.lock); + mutex_init(&fs_info->dedup_ioctl_mutex); spin_lock_init(&fs_info->qgroup_lock); mutex_init(&fs_info->qgroup_ioctl_lock); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9d46f60..e933827 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4492,6 +4492,171 @@ out_unlock: return ret; } +static long btrfs_enable_dedup(struct btrfs_root *root) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_trans_handle *trans = NULL; + struct btrfs_root *dedup_root; + int ret = 0; + + mutex_lock(&fs_info->dedup_ioctl_mutex); + if (fs_info->dedup_root) { + pr_info("btrfs: dedup has already been enabled\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return 0; + } + + trans = btrfs_start_transaction(root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return ret; + } + + dedup_root = btrfs_create_tree(trans, fs_info, + BTRFS_DEDUP_TREE_OBJECTID); + if (IS_ERR(dedup_root)) + ret = PTR_ERR(dedup_root); + + if (ret) + btrfs_end_transaction(trans, root); + else + ret = btrfs_commit_transaction(trans, root); + + if (!ret) { + pr_info("btrfs: dedup enabled\n"); + fs_info->dedup_root = dedup_root; + fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv; + btrfs_set_fs_incompat(fs_info, DEDUP); + } + + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return ret; +} + +static long btrfs_disable_dedup(struct btrfs_root *root) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_root *dedup_root; + int ret; + + mutex_lock(&fs_info->dedup_ioctl_mutex); + if (!fs_info->dedup_root) { + pr_info("btrfs: dedup has been disabled\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return 0; + } + + if (fs_info->dedup_bs != 0) { + pr_info("btrfs: cannot disable dedup until switching off dedup!\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return -EBUSY; + } + + dedup_root = fs_info->dedup_root; + + ret = btrfs_drop_snapshot(dedup_root, NULL, 1, 0); + + if (!ret) { + fs_info->dedup_root = NULL; + pr_info("btrfs: dedup disabled\n"); + } + + mutex_unlock(&fs_info->dedup_ioctl_mutex); + WARN_ON(ret < 0 && ret != -EAGAIN && ret != -EROFS); + return ret; +} + +static long btrfs_set_dedup_bs(struct btrfs_root *root, u64 bs) +{ + struct btrfs_fs_info *info = root->fs_info; + int ret = 0; + + mutex_lock(&info->dedup_ioctl_mutex); + if (!info->dedup_root) { + pr_info("btrfs: dedup is disabled, we cannot switch on/off dedup\n"); + ret = -EINVAL; + goto out; + } + + bs = ALIGN(bs, root->sectorsize); + bs = min_t(u64, bs, (128 * 1024ULL)); + + if (bs == info->dedup_bs) { + if (info->dedup_bs == 0) + pr_info("btrfs: switch OFF dedup(it''s already off)\n"); + else + pr_info("btrfs: switch ON dedup(its bs is already %llu)\n", + bs); + goto out; + } + + /* + * The dedup works similar to compression, both use async workqueue to + * reach better performance. We drain the on-going async works here + * so that new dedup writes will apply with the new dedup blocksize. + */ + atomic_inc(&info->async_submit_draining); + while (atomic_read(&info->nr_async_submits) || + atomic_read(&info->async_delalloc_pages)) { + wait_event(info->async_submit_wait, + (atomic_read(&info->nr_async_submits) == 0 && + atomic_read(&info->async_delalloc_pages) == 0)); + } + + /* + * dedup_bs = 0: dedup off; + * dedup_bs > 0: dedup on; + */ + info->dedup_bs = bs; + if (info->dedup_bs == 0) { + pr_info("btrfs: switch OFF dedup\n"); + } else { + info->dedup_bs = bs; + pr_info("btrfs: switch ON dedup(dedup blocksize %llu)\n", + info->dedup_bs); + } + atomic_dec(&info->async_submit_draining); + +out: + mutex_unlock(&info->dedup_ioctl_mutex); + return ret; +} + +static long btrfs_ioctl_dedup_ctl(struct btrfs_root *root, void __user *args) +{ + struct btrfs_ioctl_dedup_args *dargs; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + dargs = memdup_user(args, sizeof(*dargs)); + if (IS_ERR(dargs)) { + ret = PTR_ERR(dargs); + goto out; + } + + switch (dargs->cmd) { + case BTRFS_DEDUP_CTL_ENABLE: + ret = btrfs_enable_dedup(root); + break; + case BTRFS_DEDUP_CTL_DISABLE: + ret = btrfs_disable_dedup(root); + break; + case BTRFS_DEDUP_CTL_SET_BS: + /* dedup on/off */ + ret = btrfs_set_dedup_bs(root, dargs->bs); + break; + default: + ret = -EINVAL; + } + + kfree(dargs); +out: + return ret; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -4604,6 +4769,8 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_set_fslabel(file, argp); case BTRFS_IOC_FILE_EXTENT_SAME: return btrfs_ioctl_file_extent_same(file, argp); + case BTRFS_IOC_DEDUP_CTL: + return btrfs_ioctl_dedup_ctl(root, argp); } return -ENOTTY; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 45e6189..eda8b92 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -399,6 +399,15 @@ struct btrfs_ioctl_get_dev_stats { __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_ENABLE 1 +#define BTRFS_DEDUP_CTL_DISABLE 2 +#define BTRFS_DEDUP_CTL_SET_BS 3 +struct btrfs_ioctl_dedup_args { + __u64 cmd; + __u64 bs; +}; + #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED 3 @@ -606,5 +615,7 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) struct btrfs_ioctl_dev_replace_args) #define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \ struct btrfs_ioctl_same_args) +#define BTRFS_IOC_DEDUP_CTL _IOWR(BTRFS_IOCTL_MAGIC, 55, \ + struct btrfs_ioctl_dedup_args) #endif /* _UAPI_LINUX_BTRFS_H */ -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This adds deduplication subcommands, ''btrfs dedup command <path>'', including enable/disable/on/off. - btrfs dedup enable Create the dedup tree, and it''s the very first step when you''re going to use the dedup feature. - btrfs dedup disable Delete the dedup tree, after this we''re not able to use dedup any more unless you enable it again. - btrfs dedup on [-b] Switch on the dedup feature temporarily, and it''s the second step of applying dedup with writes. Option ''-b'' is used to set dedup blocksize. The default blocksize is 8192(no special reason, you may argue), and the current limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the upper limit of btrfs''s compression. - btrfs dedup off Switch off the dedup feature temporarily, but the dedup tree remains. --------------------------------------------------------- Usage: Step 1: btrfs dedup enable /btrfs Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs Step 3: now we have dedup, run your test. Step 4: btrfs dedup off /btrfs Step 5: btrfs dedup disable /btrfs --------------------------------------------------------- Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- v3: add commands ''btrfs dedup on/off'' v2: add manpage Makefile | 2 +- btrfs.c | 1 + cmds-dedup.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ commands.h | 2 + ctree.h | 2 + ioctl.h | 11 ++++ man/btrfs.8.in | 24 ++++++++ 7 files changed, 218 insertions(+), 1 deletions(-) create mode 100644 cmds-dedup.c diff --git a/Makefile b/Makefile index da7438e..5b4a07d 100644 --- a/Makefile +++ b/Makefile @@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ - cmds-restore.o + cmds-restore.o cmds-dedup.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ diff --git a/btrfs.c b/btrfs.c index 691adef..956905c 100644 --- a/btrfs.c +++ b/btrfs.c @@ -254,6 +254,7 @@ const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, "a_cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, &qgroup_cmd_group, 0 }, { "replace", cmd_replace, NULL, &replace_cmd_group, 0 }, + { "dedup", cmd_dedup, NULL, &dedup_cmd_group, 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, { 0, 0, 0, 0, 0 } diff --git a/cmds-dedup.c b/cmds-dedup.c new file mode 100644 index 0000000..38b6841 --- /dev/null +++ b/cmds-dedup.c @@ -0,0 +1,177 @@ +/* + * Copyright (C) 2013 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include <sys/ioctl.h> +#include <unistd.h> +#include <getopt.h> + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" + +static const char * const dedup_cmd_group_usage[] = { + "btrfs dedup <command> [options] <path>", + NULL +}; + +int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args) +{ + int ret = 0; + int fd; + int e; + + fd = open_file_or_dir(path); + if (fd < 0) { + fprintf(stderr, "ERROR: can''t access ''%s''\n", path); + return -EACCES; + } + + ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args); + e = errno; + close(fd); + if (ret < 0) { + fprintf(stderr, "ERROR: dedup command failed: %s\n", + strerror(e)); + if (args->cmd == BTRFS_DEDUP_CTL_DISABLE || + args->cmd == BTRFS_DEDUP_CTL_SET_BS) + fprintf(stderr, "please refer to ''dmesg | tail'' for more info\n"); + return -EINVAL; + } + return 0; +} + +static const char * const cmd_dedup_enable_usage[] = { + "btrfs dedup enable <path>", + "Enable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_enable(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_enable_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_ENABLE; + + return dedup_ctl(argv[1], &dargs); +} + +static const char * const cmd_dedup_disable_usage[] = { + "btrfs dedup disable <path>", + "Disable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_disable_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_DISABLE; + + return dedup_ctl(argv[1], &dargs); +} + +static int dedup_set_bs(char *path, struct btrfs_ioctl_dedup_args *dargs) +{ + return dedup_ctl(path, dargs); +} + +static const char * const cmd_dedup_on_usage[] = { + "btrfs dedup on [-b|--bs size] <path>", + "Switch on data deduplication or change the dedup blocksize.", + "", + "-b|--bs <size> set dedup blocksize", + NULL +}; + +static struct option longopts[] = { + {"bs", required_argument, NULL, ''b''}, + {0, 0, 0, 0} +}; + +static int cmd_dedup_on(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + u64 bs = 8192; + + optind = 1; + while (1) { + int longindex; + + int c = getopt_long(argc, argv, "b:", longopts, &longindex); + if (c < 0) + break; + + switch (c) { + case ''b'': + bs = parse_size(optarg); + break; + default: + usage(cmd_dedup_on_usage); + } + } + + if (check_argc_exact(argc - optind, 1)) + usage(cmd_dedup_on_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_SET_BS; + dargs.bs = bs; + + return dedup_set_bs(argv[optind], &dargs); +} + +static const char * const cmd_dedup_off_usage[] = { + "btrfs dedup off <path>", + "Switch off data deduplication.", + NULL +}; + +static int cmd_dedup_off(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_off_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_SET_BS; + dargs.bs = 0; + + return dedup_set_bs(argv[1], &dargs); +} + +const struct cmd_group dedup_cmd_group = { + dedup_cmd_group_usage, NULL, { + { "enable", cmd_dedup_enable, cmd_dedup_enable_usage, NULL, 0 }, + { "disable", cmd_dedup_disable, cmd_dedup_disable_usage, 0, 0 }, + { "on", cmd_dedup_on, cmd_dedup_on_usage, NULL, 0}, + { "off", cmd_dedup_off, cmd_dedup_off_usage, NULL, 0}, + { 0, 0, 0, 0, 0 } + } +}; + +int cmd_dedup(int argc, char **argv) +{ + return handle_command_group(&dedup_cmd_group, argc, argv); +} diff --git a/commands.h b/commands.h index 15c616d..d31afa4 100644 --- a/commands.h +++ b/commands.h @@ -90,6 +90,7 @@ extern const struct cmd_group receive_cmd_group; extern const struct cmd_group quota_cmd_group; extern const struct cmd_group qgroup_cmd_group; extern const struct cmd_group replace_cmd_group; +extern const struct cmd_group dedup_cmd_group; extern const char * const cmd_send_usage[]; extern const char * const cmd_receive_usage[]; @@ -112,6 +113,7 @@ int cmd_restore(int argc, char **argv); int cmd_select_super(int argc, char **argv); int cmd_dump_super(int argc, char **argv); int cmd_debug_tree(int argc, char **argv); +int cmd_dedup(int argc, char **argv); /* subvolume exported functions */ int test_issubvolume(char *path); diff --git a/ctree.h b/ctree.h index 6f086bf..e86ff96 100644 --- a/ctree.h +++ b/ctree.h @@ -467,6 +467,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL @@ -479,6 +480,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) /* diff --git a/ioctl.h b/ioctl.h index abe6dd4..dc7b6a2 100644 --- a/ioctl.h +++ b/ioctl.h @@ -417,6 +417,15 @@ struct btrfs_ioctl_get_dev_stats { __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_ENABLE 1 +#define BTRFS_DEDUP_CTL_DISABLE 2 +#define BTRFS_DEDUP_CTL_SET_BS 3 +struct btrfs_ioctl_dedup_args { + __u64 cmd; + __u64 bs; +}; + /* BTRFS_IOC_SNAP_CREATE is no longer used by the btrfs command */ #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 @@ -537,6 +546,8 @@ struct btrfs_ioctl_clone_range_args { struct btrfs_ioctl_get_dev_stats) #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_CTL _IOWR(BTRFS_IOCTL_MAGIC, 55, \ + struct btrfs_ioctl_dedup_args) #ifdef __cplusplus } diff --git a/man/btrfs.8.in b/man/btrfs.8.in index af7df4d..887a81e 100644 --- a/man/btrfs.8.in +++ b/man/btrfs.8.in @@ -74,6 +74,14 @@ btrfs \- control a btrfs filesystem .PP \fBbtrfs\fP \fBqgroup limit\fP [options] \fI<size>\fP|\fBnone\fP [\fI<qgroupid>\fP] \fI<path>\fP .PP +\fBbtrfs\fP \fBdedup enable <path> \fP\fI\fP +.PP +\fBbtrfs\fP \fBdedup disable <path> \fP\fI\fP +.PP +\fBbtrfs\fP \fBdedup on [-b|--bs size] <path> \fP\fI\fP +.PP +\fBbtrfs\fP \fBdedup off <path> \fP\fI\fP +.PP \fBbtrfs\fP \fBhelp|\-\-help \fP\fI\fP .PP \fBbtrfs\fP \fB<command> \-\-help \fP\fI\fP @@ -489,6 +497,22 @@ Show all subvolume quota groups. \fBbtrfs\fP \fBqgroup limit\fP [options] \fI<size>\fP|\fBnone\fP [\fI<qgroupid>\fP] \fI<path>\fP Limit the size of a subvolume quota group. +.TP + +\fBbtrfs dedup enable\fP \fI<path>\fP +Enable data deduplication support for a filesystem. +.TP + +\fBbtrfs dedup disable\fP \fI<path>\fP +Disable data deduplication support for a filesystem. +.TP + +\fBbtrfs dedup on [-b|--bs size]\fP \fI<path>\fP +Switch on data deduplication or change the dedup blocksize. +.TP + +\fBbtrfs dedup off\fP \fI<path>\fP +Switch off data deduplication. .RE .SH EXIT STATUS -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Aurelien Jarno
2013-Oct-22 18:55 UTC
Re: [RFC PATCH v7 00/13] Online(inband) data deduplication
Hi, On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote:> Data deduplication is a specialized data compression technique for eliminating > duplicate copies of repeating data.[1] > > This patch set is also related to "Content based storage" in project ideas[2]. > > PATCH 1 is a hang fix with deduplication on, but it''s also useful without > dedup in practice use. > > PATCH 2 and 3 are targetting delayed refs'' scalability problems, which are > uncovered by the dedup feature. > > PATCH 4 is a speed-up improvement, which is about dedup and quota. > > PATCH 5-8 is the preparation for dedup implementation. > > PATCH 9 shows how we implement dedup feature. > > PATCH 10 fixes a backref walking bug with dedup. > > PATCH 11 fixes a free space bug of dedup extents on error handling. > > PATCH 12 fixes a race bug on dedup writes. > > PATCH 13 adds the ioctl to control dedup feature. > > And there is also a btrfs-progs patch(PATCH 14) which involves all details of > how to control dedup feature. > > I''ve tested this with xfstests by adding a inline dedup ''enable & on'' in xfstests'' > mount and scratch_mount. > > TODO: > * a bit-to-bit comparison callback. > > All comments are welcome! >Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6 and it worked correctly, although I haven''t used it on production servers given the bit-to-bit comparison callback isn''t implemented yet. I have a few comments on the ioctl to control the dedup feature. Basically it is used to enable the deduplication, to switch it on or off and to select the blocksize. Couldn''t it be implemented as a mount option instead like for the other btrfs features? The dedup tree would be created the first time the mount option is activated, and the on/off would be controlled by the presence of the dedup mount flag. The blocksize could be specified with the value appended to the dedup option, for example dedup=8192. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Oct-22 20:58 UTC
Re: [PATCH v7 02/13] Btrfs: improve the delayed refs process in rm case
On Mon, Oct 14, 2013 at 12:59:44PM +0800, Liu Bo wrote:> While removing a file with dedup extents, we could have a great number of > delayed refs pending to process, and these refs refer to droping > a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type. > > But in order to prevent an extent''s ref count from going down to zero when > there still are pending delayed refs, we first select those "adding a ref" > ones, which is of BTRFS_ADD_DELAYED_REF type. > > So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type, > but we have to walk all the refs issued to the extent to find any > BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start > over again to find BTRFS_DROP_DELAYED_REF. > > This is really unnecessary, we can improve this by tracking how many > BTRFS_ADD_DELAYED_REF refs we have and search by the right type. > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > ---Reviewed-by: Josef Bacik <jbacik@fusionio.com> Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote:> Hi, > > On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote: > > Data deduplication is a specialized data compression technique for eliminating > > duplicate copies of repeating data.[1] > > > > This patch set is also related to "Content based storage" in project ideas[2]. > > > > PATCH 1 is a hang fix with deduplication on, but it''s also useful without > > dedup in practice use. > > > > PATCH 2 and 3 are targetting delayed refs'' scalability problems, which are > > uncovered by the dedup feature. > > > > PATCH 4 is a speed-up improvement, which is about dedup and quota. > > > > PATCH 5-8 is the preparation for dedup implementation. > > > > PATCH 9 shows how we implement dedup feature. > > > > PATCH 10 fixes a backref walking bug with dedup. > > > > PATCH 11 fixes a free space bug of dedup extents on error handling. > > > > PATCH 12 fixes a race bug on dedup writes. > > > > PATCH 13 adds the ioctl to control dedup feature. > > > > And there is also a btrfs-progs patch(PATCH 14) which involves all details of > > how to control dedup feature. > > > > I''ve tested this with xfstests by adding a inline dedup ''enable & on'' in xfstests'' > > mount and scratch_mount. > > > > TODO: > > * a bit-to-bit comparison callback. > > > > All comments are welcome! > > > > Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6 > and it worked correctly, although I haven''t used it on production > servers given the bit-to-bit comparison callback isn''t implemented yet.Many thanks for testing this! It''s not yet proper for production server use until we solve the metadata reservation problems(I''m working on it right now).> > I have a few comments on the ioctl to control the dedup feature. > Basically it is used to enable the deduplication, to switch it on or off > and to select the blocksize. Couldn''t it be implemented as a mount > option instead like for the other btrfs features? The dedup tree would > be created the first time the mount option is activated, and the on/off > would be controlled by the presence of the dedup mount flag. The > blocksize could be specified with the value appended to the dedup > option, for example dedup=8192.In the previous version patch set, actually I chose to use mount options to provide a flexible control of dedup, but as thread[1] shows, David thinked that mount options is too heavy to use as it cannot be removed once it''s merged. [1]: http://www.spinics.net/lists/linux-btrfs/msg27294.html -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(Cced: David) On Wed, Oct 23, 2013 at 10:26:17AM +0800, Liu Bo wrote:> On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote: > > Hi, > > > > On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote: > > > Data deduplication is a specialized data compression technique for eliminating > > > duplicate copies of repeating data.[1] > > > > > > This patch set is also related to "Content based storage" in project ideas[2]. > > > > > > PATCH 1 is a hang fix with deduplication on, but it''s also useful without > > > dedup in practice use. > > > > > > PATCH 2 and 3 are targetting delayed refs'' scalability problems, which are > > > uncovered by the dedup feature. > > > > > > PATCH 4 is a speed-up improvement, which is about dedup and quota. > > > > > > PATCH 5-8 is the preparation for dedup implementation. > > > > > > PATCH 9 shows how we implement dedup feature. > > > > > > PATCH 10 fixes a backref walking bug with dedup. > > > > > > PATCH 11 fixes a free space bug of dedup extents on error handling. > > > > > > PATCH 12 fixes a race bug on dedup writes. > > > > > > PATCH 13 adds the ioctl to control dedup feature. > > > > > > And there is also a btrfs-progs patch(PATCH 14) which involves all details of > > > how to control dedup feature. > > > > > > I''ve tested this with xfstests by adding a inline dedup ''enable & on'' in xfstests'' > > > mount and scratch_mount. > > > > > > TODO: > > > * a bit-to-bit comparison callback. > > > > > > All comments are welcome! > > > > > > > Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6 > > and it worked correctly, although I haven''t used it on production > > servers given the bit-to-bit comparison callback isn''t implemented yet. > > Many thanks for testing this! > > It''s not yet proper for production server use until we solve the > metadata reservation problems(I''m working on it right now). > > > > > I have a few comments on the ioctl to control the dedup feature. > > Basically it is used to enable the deduplication, to switch it on or off > > and to select the blocksize. Couldn''t it be implemented as a mount > > option instead like for the other btrfs features? The dedup tree would > > be created the first time the mount option is activated, and the on/off > > would be controlled by the presence of the dedup mount flag. The > > blocksize could be specified with the value appended to the dedup > > option, for example dedup=8192. > > In the previous version patch set, actually I chose to use mount options > to provide a flexible control of dedup, but as thread[1] shows, David > thinked that mount options is too heavy to use as it cannot be removed > once it''s merged. > > [1]: http://www.spinics.net/lists/linux-btrfs/msg27294.html > > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Lyakas
2013-Dec-03 17:13 UTC
Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
Hi Liu, Jan, What happens to "struct qgroup_update"s sitting in trans->qgroup_ref_list in case the transaction aborts? It seems that they are not freed. For example, if we are in btrfs_commit_transaction() and: - call create_pending_snapshots() - call btrfs_run_delayed_items() and this fails So we go to cleanup_transaction() without calling btrfs_delayed_refs_qgroup_accounting(), which would have been called by btrfs_run_delayed_refs(). I receive kmemleak warnings about these thingies not being freed, although on an older kernel. However, looking at Josef''s tree, this still seems to be the case. Thanks, Alex. On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo <bo.li.liu@oracle.com> wrote:> It''s unnecessary to do qgroups accounting without enabling quota. > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > --- > fs/btrfs/ctree.c | 2 +- > fs/btrfs/delayed-ref.c | 18 ++++++++++++++---- > fs/btrfs/qgroup.c | 3 +++ > 3 files changed, 18 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 61b5bcd..fb89235 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info, > > tree_mod_log_write_lock(fs_info); > spin_lock(&fs_info->tree_mod_seq_lock); > - if (!elem->seq) { > + if (elem && !elem->seq) { > elem->seq = btrfs_inc_tree_mod_seq_major(fs_info); > list_add_tail(&elem->list, &fs_info->tree_mod_seq_list); > } > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 9e1a1c9..3ec3d08 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct btrfs_fs_info *fs_info, > ref->is_head = 0; > ref->in_tree = 1; > > - if (need_ref_seq(for_cow, ref_root)) > - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > + if (need_ref_seq(for_cow, ref_root)) { > + struct seq_list *elem = NULL; > + > + if (fs_info->quota_enabled) > + elem = &trans->delayed_ref_elem; > + seq = btrfs_get_tree_mod_seq(fs_info, elem); > + } > ref->seq = seq; > > full_ref = btrfs_delayed_node_to_tree_ref(ref); > @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct btrfs_fs_info *fs_info, > ref->is_head = 0; > ref->in_tree = 1; > > - if (need_ref_seq(for_cow, ref_root)) > - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > + if (need_ref_seq(for_cow, ref_root)) { > + struct seq_list *elem = NULL; > + > + if (fs_info->quota_enabled) > + elem = &trans->delayed_ref_elem; > + seq = btrfs_get_tree_mod_seq(fs_info, elem); > + } > ref->seq = seq; > > full_ref = btrfs_delayed_node_to_data_ref(ref); > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 4e6ef49..1cb58f9 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, > { > struct qgroup_update *u; > > + if (!trans->root->fs_info->quota_enabled) > + return 0; > + > BUG_ON(!trans->delayed_ref_elem.seq); > u = kmalloc(sizeof(*u), GFP_NOFS); > if (!u) > -- > 1.7.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Lyakas
2013-Dec-03 18:58 UTC
Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
And same is also true for "struct tree_mod_elem" sitting in "fs_info->tree_mod_log". Usually these are also freed up by btrfs_delayed_refs_qgroup_accounting() calling btrfs_put_tree_mod_seq(). But not in case of transaction abort. Some sample kmemleak stacks below: unreferenced object 0xffff8800365605a0 (size 128): comm "btrfs", pid 21162, jiffies 4295508540 (age 516.080s) hex dump (first 32 bytes): 81 04 56 36 00 88 ff ff 00 00 00 00 00 00 00 00 ..V6............ 00 00 00 00 00 00 00 00 31 05 00 00 00 00 00 00 ........1....... backtrace: [<ffffffff816798e6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117dd5b>] kmem_cache_alloc_trace+0xab/0x160 [<ffffffffa031754f>] tree_mod_log_insert_key_locked+0x4f/0x140 [btrfs] [<ffffffffa0317702>] tree_mod_log_free_eb+0xc2/0xf0 [btrfs] [<ffffffffa031a1c6>] __btrfs_cow_block+0x316/0x520 [btrfs] [<ffffffffa031a57e>] btrfs_cow_block+0x12e/0x1f0 [btrfs] [<ffffffffa031ebd1>] btrfs_search_slot+0x381/0x830 [btrfs] [<ffffffffa0320c8c>] btrfs_insert_empty_items+0x7c/0x110 [btrfs] [<ffffffffa033d263>] insert_with_overflow+0x43/0x170 [btrfs] [<ffffffffa033d44f>] btrfs_insert_dir_item+0xbf/0x200 [btrfs] [<ffffffffa034a305>] create_pending_snapshot+0xbf5/0xd70 [btrfs] [<ffffffffa034a5e9>] create_pending_snapshots+0x169/0x240 [btrfs] [<ffffffffa034ba4a>] btrfs_commit_transaction+0x4aa/0x1080 [btrfs] [<ffffffffa0382861>] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs] [<ffffffffa0382a5f>] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs] [<ffffffffa0382cc5>] btrfs_ioctl_snap_create_v2+0x135/0x190 [btrfs] unreferenced object 0xffff88007ab00c60 (size 128): comm "btrfs", pid 21162, jiffies 4295508540 (age 516.760s) hex dump (first 32 bytes): 10 0e b0 7a 00 88 ff ff 00 00 00 00 00 00 00 00 ...z............ 00 00 00 00 00 00 00 00 b7 05 00 00 00 00 00 00 ................ backtrace: [<ffffffff816798e6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117dd5b>] kmem_cache_alloc_trace+0xab/0x160 [<ffffffffa03172e4>] tree_mod_log_insert_key_mask.isra.33+0xb4/0x1c0 [btrfs] [<ffffffffa03174fe>] tree_mod_log_insert_key+0xe/0x10 [btrfs] [<ffffffffa031a16f>] __btrfs_cow_block+0x2bf/0x520 [btrfs] [<ffffffffa031a57e>] btrfs_cow_block+0x12e/0x1f0 [btrfs] [<ffffffffa031dc43>] push_leaf_right+0x133/0x1a0 [btrfs] [<ffffffffa031e381>] split_leaf+0x5e1/0x770 [btrfs] [<ffffffffa031efd0>] btrfs_search_slot+0x780/0x830 [btrfs] [<ffffffffa0320c8c>] btrfs_insert_empty_items+0x7c/0x110 [btrfs] [<ffffffffa033d263>] insert_with_overflow+0x43/0x170 [btrfs] [<ffffffffa033d44f>] btrfs_insert_dir_item+0xbf/0x200 [btrfs] [<ffffffffa034a305>] create_pending_snapshot+0xbf5/0xd70 [btrfs] [<ffffffffa034a5e9>] create_pending_snapshots+0x169/0x240 [btrfs] [<ffffffffa034ba4a>] btrfs_commit_transaction+0x4aa/0x1080 [btrfs] [<ffffffffa0382861>] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs] unreferenced object 0xffff88007b024780 (size 32): comm "btrfs", pid 26945, jiffies 4295316905 (age 767.060s) hex dump (first 32 bytes): 20 47 02 7b 00 88 ff ff 00 4f 02 7b 00 88 ff ff G.{.....O.{.... 90 86 b6 6a 00 88 ff ff 00 00 00 00 00 00 00 00 ...j............ backtrace: [<ffffffff816798e6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117dd5b>] kmem_cache_alloc_trace+0xab/0x160 [<ffffffffa03ae944>] btrfs_qgroup_record_ref+0x44/0xd0 [btrfs] [<ffffffffa039c861>] btrfs_add_delayed_tree_ref+0x141/0x1f0 [btrfs] [<ffffffffa032f5cd>] btrfs_free_tree_block+0x9d/0x220 [btrfs] [<ffffffffa031b325>] __btrfs_cow_block+0x475/0x520 [btrfs] [<ffffffffa031b57e>] btrfs_cow_block+0x12e/0x1f0 [btrfs] [<ffffffffa031fbd1>] btrfs_search_slot+0x381/0x830 [btrfs] [<ffffffffa0321c8c>] btrfs_insert_empty_items+0x7c/0x110 [btrfs] [<ffffffffa033e253>] insert_with_overflow+0x43/0x170 [btrfs] [<ffffffffa033e43f>] btrfs_insert_dir_item+0xbf/0x200 [btrfs] [<ffffffffa034b2f5>] create_pending_snapshot+0xbf5/0xd50 [btrfs] [<ffffffffa034b551>] create_pending_snapshots+0x101/0x1d0 [btrfs] [<ffffffffa034c9aa>] btrfs_commit_transaction+0x4aa/0x1080 [btrfs] [<ffffffffa03837c1>] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs] [<ffffffffa03839bf>] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs] unreferenced object 0xffff88007b024210 (size 32): comm "btrfs", pid 26945, jiffies 4295316907 (age 767.068s) hex dump (first 32 bytes): e0 44 02 7b 00 88 ff ff 60 46 02 7b 00 88 ff ff .D.{....`F.{.... 50 e1 41 6a 00 88 ff ff 70 68 96 6b 00 88 ff ff P.Aj....ph.k.... backtrace: [<ffffffff816798e6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117dd5b>] kmem_cache_alloc_trace+0xab/0x160 [<ffffffffa03ae944>] btrfs_qgroup_record_ref+0x44/0xd0 [btrfs] [<ffffffffa039c861>] btrfs_add_delayed_tree_ref+0x141/0x1f0 [btrfs] [<ffffffffa03312c4>] btrfs_alloc_free_block+0x1a4/0x450 [btrfs] [<ffffffffa031afe8>] __btrfs_cow_block+0x138/0x520 [btrfs] [<ffffffffa031b57e>] btrfs_cow_block+0x12e/0x1f0 [btrfs] [<ffffffffa031fbd1>] btrfs_search_slot+0x381/0x830 [btrfs] [<ffffffffa0321c8c>] btrfs_insert_empty_items+0x7c/0x110 [btrfs] [<ffffffffa033e253>] insert_with_overflow+0x43/0x170 [btrfs] [<ffffffffa033e43f>] btrfs_insert_dir_item+0xbf/0x200 [btrfs] [<ffffffffa034b2f5>] create_pending_snapshot+0xbf5/0xd50 [btrfs] [<ffffffffa034b551>] create_pending_snapshots+0x101/0x1d0 [btrfs] [<ffffffffa034c9aa>] btrfs_commit_transaction+0x4aa/0x1080 [btrfs] [<ffffffffa03837c1>] btrfs_mksubvol.isra.51+0x501/0x5f0 [btrfs] [<ffffffffa03839bf>] btrfs_ioctl_snap_create_transid+0x10f/0x1b0 [btrfs] On Tue, Dec 3, 2013 at 7:13 PM, Alex Lyakas <alex.btrfs@zadarastorage.com> wrote:> Hi Liu, Jan, > > What happens to "struct qgroup_update"s sitting in > trans->qgroup_ref_list in case the transaction aborts? It seems that > they are not freed. > > For example, if we are in btrfs_commit_transaction() and: > - call create_pending_snapshots() > - call btrfs_run_delayed_items() and this fails > So we go to cleanup_transaction() without calling > btrfs_delayed_refs_qgroup_accounting(), which would have been called > by btrfs_run_delayed_refs(). > > I receive kmemleak warnings about these thingies not being freed, > although on an older kernel. However, looking at Josef''s tree, this > still seems to be the case. > > Thanks, > Alex. > > > On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo <bo.li.liu@oracle.com> wrote: >> It''s unnecessary to do qgroups accounting without enabling quota. >> >> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> >> --- >> fs/btrfs/ctree.c | 2 +- >> fs/btrfs/delayed-ref.c | 18 ++++++++++++++---- >> fs/btrfs/qgroup.c | 3 +++ >> 3 files changed, 18 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index 61b5bcd..fb89235 100644 >> --- a/fs/btrfs/ctree.c >> +++ b/fs/btrfs/ctree.c >> @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info, >> >> tree_mod_log_write_lock(fs_info); >> spin_lock(&fs_info->tree_mod_seq_lock); >> - if (!elem->seq) { >> + if (elem && !elem->seq) { >> elem->seq = btrfs_inc_tree_mod_seq_major(fs_info); >> list_add_tail(&elem->list, &fs_info->tree_mod_seq_list); >> } >> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c >> index 9e1a1c9..3ec3d08 100644 >> --- a/fs/btrfs/delayed-ref.c >> +++ b/fs/btrfs/delayed-ref.c >> @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct btrfs_fs_info *fs_info, >> ref->is_head = 0; >> ref->in_tree = 1; >> >> - if (need_ref_seq(for_cow, ref_root)) >> - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); >> + if (need_ref_seq(for_cow, ref_root)) { >> + struct seq_list *elem = NULL; >> + >> + if (fs_info->quota_enabled) >> + elem = &trans->delayed_ref_elem; >> + seq = btrfs_get_tree_mod_seq(fs_info, elem); >> + } >> ref->seq = seq; >> >> full_ref = btrfs_delayed_node_to_tree_ref(ref); >> @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct btrfs_fs_info *fs_info, >> ref->is_head = 0; >> ref->in_tree = 1; >> >> - if (need_ref_seq(for_cow, ref_root)) >> - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); >> + if (need_ref_seq(for_cow, ref_root)) { >> + struct seq_list *elem = NULL; >> + >> + if (fs_info->quota_enabled) >> + elem = &trans->delayed_ref_elem; >> + seq = btrfs_get_tree_mod_seq(fs_info, elem); >> + } >> ref->seq = seq; >> >> full_ref = btrfs_delayed_node_to_data_ref(ref); >> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c >> index 4e6ef49..1cb58f9 100644 >> --- a/fs/btrfs/qgroup.c >> +++ b/fs/btrfs/qgroup.c >> @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, >> { >> struct qgroup_update *u; >> >> + if (!trans->root->fs_info->quota_enabled) >> + return 0; >> + >> BUG_ON(!trans->delayed_ref_elem.seq); >> u = kmalloc(sizeof(*u), GFP_NOFS); >> if (!u) >> -- >> 1.7.7 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Dec-13 05:42 UTC
Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
On Tue, Dec 03, 2013 at 07:13:48PM +0200, Alex Lyakas wrote:> Hi Liu, Jan, > > What happens to "struct qgroup_update"s sitting in > trans->qgroup_ref_list in case the transaction aborts? It seems that > they are not freed. > > For example, if we are in btrfs_commit_transaction() and: > - call create_pending_snapshots() > - call btrfs_run_delayed_items() and this fails > So we go to cleanup_transaction() without calling > btrfs_delayed_refs_qgroup_accounting(), which would have been called > by btrfs_run_delayed_refs(). > > I receive kmemleak warnings about these thingies not being freed, > although on an older kernel. However, looking at Josef''s tree, this > still seems to be the case.I think you''re right, but I''m sure because I cannot reproduce that somehow, so I suggest to add a assert_qgroups_uptodate() in cleanup_transaction() in order to catch that leak if there is. thanks, -liubo> > Thanks, > Alex. > > > On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > > It''s unnecessary to do qgroups accounting without enabling quota. > > > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > > --- > > fs/btrfs/ctree.c | 2 +- > > fs/btrfs/delayed-ref.c | 18 ++++++++++++++---- > > fs/btrfs/qgroup.c | 3 +++ > > 3 files changed, 18 insertions(+), 5 deletions(-) > > > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > > index 61b5bcd..fb89235 100644 > > --- a/fs/btrfs/ctree.c > > +++ b/fs/btrfs/ctree.c > > @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info, > > > > tree_mod_log_write_lock(fs_info); > > spin_lock(&fs_info->tree_mod_seq_lock); > > - if (!elem->seq) { > > + if (elem && !elem->seq) { > > elem->seq = btrfs_inc_tree_mod_seq_major(fs_info); > > list_add_tail(&elem->list, &fs_info->tree_mod_seq_list); > > } > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > > index 9e1a1c9..3ec3d08 100644 > > --- a/fs/btrfs/delayed-ref.c > > +++ b/fs/btrfs/delayed-ref.c > > @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct btrfs_fs_info *fs_info, > > ref->is_head = 0; > > ref->in_tree = 1; > > > > - if (need_ref_seq(for_cow, ref_root)) > > - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > > + if (need_ref_seq(for_cow, ref_root)) { > > + struct seq_list *elem = NULL; > > + > > + if (fs_info->quota_enabled) > > + elem = &trans->delayed_ref_elem; > > + seq = btrfs_get_tree_mod_seq(fs_info, elem); > > + } > > ref->seq = seq; > > > > full_ref = btrfs_delayed_node_to_tree_ref(ref); > > @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct btrfs_fs_info *fs_info, > > ref->is_head = 0; > > ref->in_tree = 1; > > > > - if (need_ref_seq(for_cow, ref_root)) > > - seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > > + if (need_ref_seq(for_cow, ref_root)) { > > + struct seq_list *elem = NULL; > > + > > + if (fs_info->quota_enabled) > > + elem = &trans->delayed_ref_elem; > > + seq = btrfs_get_tree_mod_seq(fs_info, elem); > > + } > > ref->seq = seq; > > > > full_ref = btrfs_delayed_node_to_data_ref(ref); > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > > index 4e6ef49..1cb58f9 100644 > > --- a/fs/btrfs/qgroup.c > > +++ b/fs/btrfs/qgroup.c > > @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, > > { > > struct qgroup_update *u; > > > > + if (!trans->root->fs_info->quota_enabled) > > + return 0; > > + > > BUG_ON(!trans->delayed_ref_elem.seq); > > u = kmalloc(sizeof(*u), GFP_NOFS); > > if (!u) > > -- > > 1.7.7 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Dec-13 06:04 UTC
Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
On Fri, Dec 13, 2013 at 01:42:57PM +0800, Liu Bo wrote:> On Tue, Dec 03, 2013 at 07:13:48PM +0200, Alex Lyakas wrote: > > Hi Liu, Jan, > > > > What happens to "struct qgroup_update"s sitting in > > trans->qgroup_ref_list in case the transaction aborts? It seems that > > they are not freed. > > > > For example, if we are in btrfs_commit_transaction() and: > > - call create_pending_snapshots() > > - call btrfs_run_delayed_items() and this fails > > So we go to cleanup_transaction() without calling > > btrfs_delayed_refs_qgroup_accounting(), which would have been called > > by btrfs_run_delayed_refs(). > > > > I receive kmemleak warnings about these thingies not being freed, > > although on an older kernel. However, looking at Josef''s tree, this > > still seems to be the case. > > I think you''re right, but I''m sure because I cannot reproduce that somehow,Sorry, s/''sure''/''not sure''/g -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html