Stefan Behrens
2013-May-14 09:36 UTC
[PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. So the issue to address is that Btrfs send / receive does not work as it is today when a high number of subvolumes exist. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve the subvol ID. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That''s the justification why a completely new tree was introduced. v1 -> v2: - All review comments from David Sterba, Josef Bacik and Jan Schmidt are addressed. The hugest change was to add a mechanism that handles the case that the filesystem is mounted with an older kernel. Now that case is detected when the filesystem is mounted with a newer kernel again, and the UUID tree is updated in the background. Stefan Behrens (8): Btrfs: introduce a tree for items that map UUIDs to something Btrfs: support printing UUID tree elements Btrfs: create UUID tree if required Btrfs: maintain subvolume items in the UUID tree Btrfs: fill UUID tree initially Btrfs: introduce uuid-tree-gen field Btrfs: check UUID tree during mount if required Btrfs: add mount option to force UUID tree checking fs/btrfs/Makefile | 3 +- fs/btrfs/ctree.h | 65 ++++- fs/btrfs/disk-io.c | 62 ++++- fs/btrfs/ioctl.c | 74 +++++- fs/btrfs/print-tree.c | 82 +++++++ fs/btrfs/super.c | 8 +- fs/btrfs/transaction.c | 21 +- fs/btrfs/uuid-tree.c | 637 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.c | 258 ++++++++++++++++++++ fs/btrfs/volumes.h | 2 + 10 files changed, 1197 insertions(+), 15 deletions(-) create mode 100644 fs/btrfs/uuid-tree.c -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 09:36 UTC
[PATCH v2 1/8] Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added with this commit that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve data that is assigned to this UUID, like which subvolume ID is related to this UUID. This commit adds a new tree to hold UUID-to-data mapping items. The key of the items is the full UUID plus the key type BTRFS_UUID_KEY. Multiple data blocks can be stored for a given UUID, a type/length/ value scheme is used. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That''s the justification why a completely new tree was introduced. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/Makefile | 3 +- fs/btrfs/ctree.h | 50 ++++++ fs/btrfs/uuid-tree.c | 481 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 533 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 3932224..a550dfc 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o + reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ + uuid-tree.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f415377..ca9149a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -91,6 +91,9 @@ struct btrfs_ordered_sum; /* holds quota configuration and tracking */ #define BTRFS_QUOTA_TREE_OBJECTID 8ULL +/* for storing items that use the BTRFS_UUID_KEY */ +#define BTRFS_UUID_TREE_OBJECTID 9ULL + /* for storing balance parameters in the root tree */ #define BTRFS_BALANCE_OBJECTID -4ULL @@ -953,6 +956,18 @@ struct btrfs_dev_replace_item { __le64 num_uncorrectable_read_errors; } __attribute__ ((__packed__)); +/* for items that use the BTRFS_UUID_KEY */ +#define BTRFS_UUID_ITEM_TYPE_SUBVOL 0 /* for UUIDs assigned to subvols */ +#define BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL 1 /* for UUIDs assigned to + * received subvols */ + +/* a sequence of such items is stored under the BTRFS_UUID_KEY */ +struct btrfs_uuid_item { + __le64 type; /* refer to BTRFS_UUID_ITEM_TYPE* defines above */ + __le64 len; /* number of following 64bit values */ + __le64 subid[0]; /* sequence of subids */ +} __attribute__ ((__packed__)); + /* different types of block groups (and chunks) */ #define BTRFS_BLOCK_GROUP_DATA (1ULL << 0) #define BTRFS_BLOCK_GROUP_SYSTEM (1ULL << 1) @@ -1901,6 +1916,17 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_DEV_REPLACE_KEY 250 /* + * Stores items that allow to quickly map UUIDs to something else. + * These items are part of the filesystem UUID tree. + * The key is built like this: + * (UUID_upper_64_bits, BTRFS_UUID_KEY, UUID_lower_64_bits). + */ +#if BTRFS_UUID_SIZE != 16 +#error "UUID items require BTRFS_UUID_SIZE == 16!" +#endif +#define BTRFS_UUID_KEY 251 + +/* * string items are for debugging. They just store a short string of * data in the FS */ @@ -2956,6 +2982,12 @@ BTRFS_SETGET_FUNCS(dev_replace_cursor_left, struct btrfs_dev_replace_item, BTRFS_SETGET_FUNCS(dev_replace_cursor_right, struct btrfs_dev_replace_item, cursor_right, 64); +/* btrfs_uuid_item */ +BTRFS_SETGET_FUNCS(uuid_type, struct btrfs_uuid_item, type, 64); +BTRFS_SETGET_FUNCS(uuid_len, struct btrfs_uuid_item, len, 64); +BTRFS_SETGET_STACK_FUNCS(stack_uuid_type, struct btrfs_uuid_item, type, 64); +BTRFS_SETGET_STACK_FUNCS(stack_uuid_len, struct btrfs_uuid_item, len, 64); + BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_src_devid, struct btrfs_dev_replace_item, src_devid, 64); BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cont_reading_from_srcdev_mode, @@ -3372,6 +3404,24 @@ void btrfs_check_and_init_root_item(struct btrfs_root_item *item); void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root); +/* uuid-tree.c */ +int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, + u64 *subvol_id); +int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id); +int btrfs_del_uuid_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id); +int btrfs_lookup_uuid_received_subvol_item(struct btrfs_root *uuid_root, + u8 *uuid, u64 *subvol_id); +int btrfs_insert_uuid_received_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, + u8 *uuid, u64 subvol_id); +int btrfs_del_uuid_received_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id); + /* dir-item.c */ int btrfs_check_dir_item_collision(struct btrfs_root *root, u64 dir, const char *name, int name_len); diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c new file mode 100644 index 0000000..a44e13f --- /dev/null +++ b/fs/btrfs/uuid-tree.c @@ -0,0 +1,481 @@ +/* + * Copyright (C) STRATO AG 2013. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ +#include <linux/uuid.h> +#include <asm/unaligned.h> +#include "ctree.h" +#include "transaction.h" +#include "disk-io.h" +#include "print-tree.h" + + +/* + * One key is used to store a sequence of btrfs_uuid_item items. + * Each item in the sequence contains a type information and a sequence of + * ids (together with the information about the size of the sequence of ids). + * {[btrfs_uuid_item type0 {id0, id1, ..., idN}], + * ..., + * [btrfs_uuid_item typeZ {id0, id1, ..., idN}]} + * + * It is forbidden to put multiple items with the same type under the same key. + * Instead the sequence of ids is extended and used to store any additional + * ids for the same item type. + */ + +static void btrfs_uuid_to_key(u8 *uuid, struct btrfs_key *key) +{ + key->type = BTRFS_UUID_KEY; + key->objectid = get_unaligned_le64(uuid); + key->offset = get_unaligned_le64(uuid + sizeof(u64)); +} + +static struct btrfs_uuid_item *btrfs_match_uuid_item_type( + struct btrfs_path *path, u64 type) +{ + struct extent_buffer *eb; + int slot; + struct btrfs_uuid_item *ptr; + u32 item_size; + + eb = path->nodes[0]; + slot = path->slots[0]; + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); + item_size = btrfs_item_size_nr(eb, slot); + do { + u64 sub_item_type; + u64 sub_item_len; + + if (item_size < sizeof(*ptr)) { + pr_warn("btrfs: uuid item too short (%llu < %d)!\n", + (unsigned long long)item_size, + (int)sizeof(*ptr)); + return NULL; + } + item_size -= sizeof(*ptr); + sub_item_type = btrfs_uuid_type(eb, ptr); + sub_item_len = btrfs_uuid_len(eb, ptr); + if (sub_item_len * sizeof(u64) > item_size) { + pr_warn("btrfs: uuid item too short (%llu > %llu)!\n", + (unsigned long long)(sub_item_len * + sizeof(u64)), + (unsigned long long)item_size); + return NULL; + } + if (sub_item_type == type) + return ptr; + item_size -= sub_item_len * sizeof(u64); + ptr = 1 + (struct btrfs_uuid_item *) + (((char *)ptr) + (sub_item_len * sizeof(u64))); + } while (item_size); + + return NULL; +} + +static int btrfs_uuid_tree_lookup_prepare(struct btrfs_root *uuid_root, + u8 *uuid, u64 type, + struct btrfs_path *path, + struct btrfs_uuid_item **ptr) +{ + int ret; + struct btrfs_key key; + + if (!uuid_root) { + WARN_ON_ONCE(1); + ret = -ENOENT; + goto out; + } + + btrfs_uuid_to_key(uuid, &key); + + ret = btrfs_search_slot(NULL, uuid_root, &key, path, 0, 0); + if (ret < 0) + goto out; + if (ret > 0) { + ret = -ENOENT; + goto out; + } + + *ptr = btrfs_match_uuid_item_type(path, type); + if (!*ptr) { + ret = -ENOENT; + goto out; + } + + ret = 0; + +out: + return ret; +} + +/* return -ENOENT for !found, < 0 for errors, or 0 if an item was found */ +static int btrfs_uuid_tree_lookup(struct btrfs_root *uuid_root, u8 *uuid, + u64 type, u64 subid) +{ + int ret; + struct btrfs_path *path; + struct extent_buffer *eb; + struct btrfs_uuid_item *ptr; + u64 sub_item_len; + unsigned long offset; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ret = btrfs_uuid_tree_lookup_prepare(uuid_root, uuid, type, path, &ptr); + if (ret < 0) + goto out; + + eb = path->nodes[0]; + sub_item_len = btrfs_uuid_len(eb, ptr); + WARN_ON_ONCE(sub_item_len == 0); + ret = -ENOENT; + ptr++; + offset = (unsigned long)ptr; + while (sub_item_len > 0) { + u64 data; + + read_extent_buffer(eb, &data, offset, sizeof(data)); + data = le64_to_cpu(data); + if (data == subid) { + ret = 0; + break; + } + offset += sizeof(data); + sub_item_len--; + } + +out: + btrfs_free_path(path); + return ret; +} + +/* return -ENOENT for !found, < 0 for errors, or 0 if an item was found */ +static int btrfs_uuid_tree_lookup_any(struct btrfs_root *uuid_root, u8 *uuid, + u64 type, u64 *subid) +{ + int ret; + struct btrfs_path *path; + struct extent_buffer *eb; + struct btrfs_uuid_item *ptr; + u64 sub_item_len; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ret = btrfs_uuid_tree_lookup_prepare(uuid_root, uuid, type, path, &ptr); + if (ret < 0) + goto out; + + eb = path->nodes[0]; + sub_item_len = btrfs_uuid_len(eb, ptr); + WARN_ON_ONCE(sub_item_len == 0); + if (sub_item_len > 0) { + /* return first stored id */ + read_extent_buffer(eb, subid, (unsigned long)(ptr + 1), + sizeof(*subid)); + *subid = le64_to_cpu(*subid); + ret = 0; + } else { + ret = -ENOENT; + } + +out: + btrfs_free_path(path); + return ret; +} + +/* it is not checked whether the entry to add already exists */ +static int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 type, u64 subid) +{ + int ret; + struct btrfs_path *path = NULL; + struct btrfs_key key; + struct extent_buffer *eb; + int slot; + struct btrfs_uuid_item *ptr; + u32 item_size; + + if (!uuid_root) { + WARN_ON_ONCE(1); + ret = -EINVAL; + goto out; + } + + btrfs_uuid_to_key(uuid, &key); + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ret = btrfs_insert_empty_item(trans, uuid_root, path, &key, + sizeof(*ptr) + sizeof(subid)); + if (ret == -EEXIST) { + ptr = btrfs_match_uuid_item_type(path, type); + if (ptr) { + /* + * An item with that type already exists. + * Extend the item and store the subid at the + * location of the first id of this type. + */ + unsigned long start; + unsigned long move_dst; + unsigned long move_src; + unsigned long move_len; + + btrfs_extend_item(uuid_root, path, sizeof(subid)); + ptr = (struct btrfs_uuid_item *) + (((char *)ptr) - sizeof(subid)); + eb = path->nodes[0]; + slot = path->slots[0]; + item_size = btrfs_item_size_nr(eb, slot); + WARN_ON(btrfs_uuid_len(eb, ptr) == 0); + btrfs_set_uuid_len(eb, ptr, + btrfs_uuid_len(eb, ptr) + 1); + ptr++; + start = btrfs_item_ptr_offset(eb, slot); + move_dst = ((unsigned long)ptr) + sizeof(subid); + move_src = (unsigned long)ptr; + move_len = item_size - (move_dst - start); + memmove_extent_buffer(eb, move_dst, move_src, move_len); + + goto write_subid; + } else { + btrfs_extend_item(uuid_root, path, + sizeof(*ptr) + sizeof(subid)); + } + } else if (ret < 0) { + pr_warn("btrfs: insert uuid item failed %d (0x%016llx, 0x%016llx) type %llu!\n", + ret, (unsigned long long)key.objectid, + (unsigned long long)key.offset, + (unsigned long long)type); + goto out; + } + + /* + * Add an item for the type for the first time. Either add it behind + * items with different types, or write the very first item. + */ + eb = path->nodes[0]; + slot = path->slots[0]; + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); + item_size = btrfs_item_size_nr(eb, slot); + ptr = (struct btrfs_uuid_item *) + (((char *)ptr) + item_size - (sizeof(*ptr) + sizeof(subid))); + btrfs_set_uuid_type(eb, ptr, type); + btrfs_set_uuid_len(eb, ptr, 1); + ptr++; + +write_subid: + ret = 0; + subid = cpu_to_le64(subid); + write_extent_buffer(eb, &subid, (unsigned long)ptr, sizeof(subid)); + btrfs_mark_buffer_dirty(eb); + +out: + btrfs_free_path(path); + return ret; +} + +static int btrfs_uuid_tree_rem(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 type, u64 subid) +{ + int ret; + struct btrfs_path *path = NULL; + struct btrfs_key key; + struct extent_buffer *eb; + int slot; + struct btrfs_uuid_item *ptr_start; + unsigned long offset; + u32 item_size; + u64 id_index; + unsigned long start; + unsigned long move_dst; + unsigned long move_src; + unsigned long move_len; + + if (!uuid_root) { + WARN_ON_ONCE(1); + ret = -EINVAL; + goto out; + } + + btrfs_uuid_to_key(uuid, &key); + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ret = btrfs_search_slot(trans, uuid_root, &key, path, -1, 1); + if (ret < 0) { + pr_warn("btrfs: error %d while searching for uuid item!\n", + ret); + goto out; + } + if (ret > 0) { + ret = -ENOENT; + goto out; + } + + ptr_start = btrfs_match_uuid_item_type(path, type); + if (!ptr_start) { + /* case 1: no entry for that type is found */ + ret = -ENOENT; + goto out; + } + + eb = path->nodes[0]; + slot = path->slots[0]; + start = btrfs_item_ptr_offset(eb, slot); + item_size = btrfs_item_size_nr(eb, slot); + id_index = btrfs_uuid_len(eb, ptr_start); + offset = (unsigned long)(ptr_start + 1); + while (id_index > 0) { + u64 found_subid; + + read_extent_buffer(eb, &found_subid, offset, + sizeof(found_subid)); + found_subid = le64_to_cpu(found_subid); + if (found_subid == subid) + break; + offset += sizeof(found_subid); + id_index--; + } + + if (!id_index) { + /* + * case 2: an entry for that type is found, but no match + * for the id + */ + ret = -ENOENT; + goto out; + } + + if (btrfs_uuid_len(eb, ptr_start) == 1) { + if (item_size == sizeof(*ptr_start) + sizeof(subid)) { + /* + * case 3: no other types and ids are stored, + * delete the full item + */ + ret = btrfs_del_item(trans, uuid_root, path); + goto out; + } else { + /* + * case 4: No other ids for the particular type + * are stored, but other types exist. Shrink the + * item by the size of an id plus the size of the + * item for that type. + */ + move_dst = (unsigned long)ptr_start; + } + } else { + /* + * case 5: Other ids for that particular type are stored. + * Shrink the item by the size of an id. + */ + move_dst = offset; + WARN_ON(btrfs_uuid_len(eb, ptr_start) == 0); + btrfs_set_uuid_len(eb, ptr_start, + btrfs_uuid_len(eb, ptr_start) - 1); + } + move_src = offset + sizeof(subid); + move_len = item_size - (move_src - start); + + memmove_extent_buffer(eb, move_dst, move_src, move_len); + btrfs_truncate_item(uuid_root, path, item_size - (move_src - move_dst), + 1); + +out: + btrfs_free_path(path); + return ret; +} + +int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, + u64 *subvol_id) +{ + return btrfs_uuid_tree_lookup_any(uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_SUBVOL, + subvol_id); +} + +int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id) +{ + int ret; + + ret = btrfs_uuid_tree_lookup(uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_SUBVOL, subvol_id); + if (ret == -ENOENT) + ret = btrfs_uuid_tree_add(trans, uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_SUBVOL, + subvol_id); + return ret; +} + +int btrfs_del_uuid_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id) +{ + return btrfs_uuid_tree_rem(trans, uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_SUBVOL, subvol_id); +} + +int btrfs_lookup_uuid_received_subvol_item(struct btrfs_root *uuid_root, + u8 *uuid, u64 *subvol_id) +{ + return btrfs_uuid_tree_lookup_any(uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, + subvol_id); +} + +int btrfs_insert_uuid_received_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, + u8 *uuid, u64 subvol_id) +{ + int ret; + + ret = btrfs_uuid_tree_lookup(uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, + subvol_id); + if (ret == -ENOENT) + ret = btrfs_uuid_tree_add(trans, uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, + subvol_id); + return ret; +} + +int btrfs_del_uuid_received_subvol_item(struct btrfs_trans_handle *trans, + struct btrfs_root *uuid_root, u8 *uuid, + u64 subvol_id) +{ + return btrfs_uuid_tree_rem(trans, uuid_root, uuid, + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, + subvol_id); +} -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 09:36 UTC
[PATCH v2 2/8] Btrfs: support printing UUID tree elements
This commit adds support to print UUID tree elements to print-tree.c. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/print-tree.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c index dc0024f..121cf64 100644 --- a/fs/btrfs/print-tree.c +++ b/fs/btrfs/print-tree.c @@ -155,6 +155,82 @@ static void print_extent_ref_v0(struct extent_buffer *eb, int slot) } #endif +static void print_uuid_item(struct extent_buffer *l, + struct btrfs_uuid_item *ptr, + u32 item_size) +{ + do { + u64 sub_item_type; + u64 sub_item_len; + u64 subvol_id; + unsigned long offset; + + if (item_size < sizeof(*ptr)) { + printk(KERN_INFO + "btrfs: uuid item too short (%llu < %d)!\n", + (unsigned long long)item_size, + (int)sizeof(*ptr)); + return; + } + sub_item_type = btrfs_uuid_type(l, ptr); + sub_item_len = btrfs_uuid_len(l, ptr); + ptr++; + item_size -= sizeof(*ptr); + + if (sub_item_len * sizeof(u64) > item_size) { + printk(KERN_INFO + "btrfs: uuid item too short (%llu > %llu)!\n", + (unsigned long long)(sub_item_len * 8), + (unsigned long long)item_size); + return; + } + + offset = (unsigned long)ptr; + ptr = (struct btrfs_uuid_item *) + (((char *)ptr) + sub_item_len * sizeof(u64)); + item_size -= sub_item_len * sizeof(u64); + switch (sub_item_type) { + case BTRFS_UUID_ITEM_TYPE_SUBVOL: + while (sub_item_len) { + read_extent_buffer(l, &subvol_id, offset, + sizeof(u64)); + printk(KERN_INFO "\t\tsubvol_id %llu\n", + (unsigned long long) + le64_to_cpu(subvol_id)); + sub_item_len--; + offset += sizeof(u64); + } + break; + case BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL: + while (sub_item_len) { + read_extent_buffer(l, &subvol_id, offset, + sizeof(u64)); + printk(KERN_INFO + "\t\treceived_subvol_id %llu\n", + (unsigned long long) + le64_to_cpu(subvol_id)); + sub_item_len--; + offset += sizeof(u64); + } + break; + default: + printk(KERN_INFO "\t\tunknown type=%llu, len=8*%llu\n", + (unsigned long long)sub_item_type, + (unsigned long long)sub_item_len); + while (sub_item_len) { + read_extent_buffer(l, &subvol_id, offset, + sizeof(u64)); + printk(KERN_INFO "\t\tid %llu\n", + (unsigned long long) + le64_to_cpu(subvol_id)); + sub_item_len--; + offset += sizeof(u64); + } + break; + } + } while (item_size); +} + void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) { int i; @@ -168,6 +244,7 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) struct btrfs_extent_data_ref *dref; struct btrfs_shared_data_ref *sref; struct btrfs_dev_extent *dev_extent; + struct btrfs_uuid_item *uuid_item; struct btrfs_key key; struct btrfs_key found_key; @@ -301,6 +378,11 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) case BTRFS_DEV_REPLACE_KEY: printk(KERN_INFO "\t\tdev replace\n"); break; + case BTRFS_UUID_KEY: + uuid_item = btrfs_item_ptr(l, i, + struct btrfs_uuid_item); + print_uuid_item(l, uuid_item, btrfs_item_size_nr(l, i)); + break; }; } } -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 40 +++++++++++++++++++++++++++++++++++++++- fs/btrfs/volumes.c | 27 +++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 4 files changed, 68 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ca9149a..6f9d760 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1305,6 +1305,7 @@ struct btrfs_fs_info { struct btrfs_root *fs_root; struct btrfs_root *csum_root; struct btrfs_root *quota_root; + struct btrfs_root *uuid_root; /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4904d7f..3c2bf74 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1499,6 +1499,9 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info, if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID) return fs_info->quota_root ? fs_info->quota_root : ERR_PTR(-ENOENT); + if (location->objectid == BTRFS_UUID_TREE_OBJECTID) + return fs_info->uuid_root ? fs_info->uuid_root : + ERR_PTR(-ENOENT); again: spin_lock(&fs_info->fs_roots_radix_lock); root = radix_tree_lookup(&fs_info->fs_roots_radix, @@ -1955,6 +1958,10 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) free_extent_buffer(info->quota_root->node); free_extent_buffer(info->quota_root->commit_root); } + if (info->uuid_root) { + free_extent_buffer(info->uuid_root->node); + free_extent_buffer(info->uuid_root->commit_root); + } info->tree_root->node = NULL; info->tree_root->commit_root = NULL; @@ -1968,6 +1975,10 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) info->quota_root->node = NULL; info->quota_root->commit_root = NULL; } + if (info->uuid_root) { + info->uuid_root->node = NULL; + info->uuid_root->commit_root = NULL; + } if (chunk_root) { free_extent_buffer(info->chunk_root->node); @@ -2029,11 +2040,13 @@ int open_ctree(struct super_block *sb, struct btrfs_root *chunk_root; struct btrfs_root *dev_root; struct btrfs_root *quota_root; + struct btrfs_root *uuid_root; struct btrfs_root *log_tree_root; int ret; int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; + bool create_uuid_tree = false; tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info); extent_root = fs_info->extent_root = btrfs_alloc_root(fs_info); @@ -2041,9 +2054,10 @@ int open_ctree(struct super_block *sb, chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info); dev_root = fs_info->dev_root = btrfs_alloc_root(fs_info); quota_root = fs_info->quota_root = btrfs_alloc_root(fs_info); + uuid_root = fs_info->uuid_root = btrfs_alloc_root(fs_info); if (!tree_root || !extent_root || !csum_root || - !chunk_root || !dev_root || !quota_root) { + !chunk_root || !dev_root || !quota_root || !uuid_root) { err = -ENOMEM; goto fail; } @@ -2604,6 +2618,19 @@ retry_root_backup: fs_info->pending_quota_state = 1; } + ret = find_and_setup_root(tree_root, fs_info, + BTRFS_UUID_TREE_OBJECTID, uuid_root); + if (ret) { + if (ret != -ENOENT) + goto recovery_tree_root; + + kfree(uuid_root); + uuid_root = fs_info->uuid_root = NULL; + create_uuid_tree = true; + } else { + uuid_root->track_dirty = 1; + } + fs_info->generation = generation; fs_info->last_trans_committed = generation; @@ -2788,6 +2815,17 @@ retry_root_backup: return ret; } + if (create_uuid_tree) { + pr_info("btrfs: creating UUID tree\n"); + ret = btrfs_create_uuid_tree(fs_info); + if (ret) { + pr_warn("btrfs: failed to create the UUID tree %d\n", + ret); + close_ctree(tree_root); + return ret; + } + } + return 0; fail_qgroup: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a191bac..e6f5b88 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3427,6 +3427,33 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) return 0; } +int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) +{ + struct btrfs_trans_handle *trans; + struct btrfs_root *tree_root = fs_info->tree_root; + struct btrfs_root *uuid_root; + + /* + * 1 - root node + * 1 - root item + */ + trans = btrfs_start_transaction(tree_root, 2); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + uuid_root = btrfs_create_tree(trans, fs_info, + BTRFS_UUID_TREE_OBJECTID); + if (IS_ERR(uuid_root)) { + btrfs_abort_transaction(trans, tree_root, + PTR_ERR(uuid_root)); + return PTR_ERR(uuid_root); + } + + fs_info->uuid_root = uuid_root; + uuid_root->track_dirty = 1; + + return btrfs_commit_transaction(trans, tree_root); +} /* * shrinking a device means finding all of the device extents past * the new size, and then following the back refs to the chunks. diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 845ccbb..3479532 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -295,6 +295,7 @@ int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info); int btrfs_recover_balance(struct btrfs_fs_info *fs_info); int btrfs_pause_balance(struct btrfs_fs_info *fs_info); int btrfs_cancel_balance(struct btrfs_fs_info *fs_info); +int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info); int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, u64 *start, u64 *max_avail); -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 09:36 UTC
[PATCH v2 4/8] Btrfs: maintain subvolume items in the UUID tree
When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 1 + fs/btrfs/ioctl.c | 74 +++++++++++++++++++++++++++++++++++++++++++------- fs/btrfs/transaction.c | 19 ++++++++++++- 3 files changed, 83 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6f9d760..73d797c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3634,6 +3634,7 @@ extern const struct dentry_operations btrfs_dentry_operations; long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); void btrfs_update_iflags(struct inode *inode); void btrfs_inherit_iflags(struct inode *inode, struct inode *dir); +int btrfs_is_empty_uuid(u8 *uuid); int btrfs_defrag_file(struct inode *inode, struct file *file, struct btrfs_ioctl_defrag_range_args *range, u64 newer_than, unsigned long max_pages); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b2537c7..47bef5a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -363,6 +363,13 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) return 0; } +int btrfs_is_empty_uuid(u8 *uuid) +{ + static char empty_uuid[BTRFS_UUID_SIZE] = {0}; + + return !memcmp(uuid, empty_uuid, BTRFS_UUID_SIZE); +} + static noinline int create_subvol(struct inode *dir, struct dentry *dentry, char *name, int namelen, @@ -396,7 +403,7 @@ static noinline int create_subvol(struct inode *dir, * of create_snapshot(). */ ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, - 7, &qgroup_reserved); + 8, &qgroup_reserved); if (ret) return ret; @@ -518,9 +525,13 @@ static noinline int create_subvol(struct inode *dir, ret = btrfs_add_root_ref(trans, root->fs_info->tree_root, objectid, root->root_key.objectid, btrfs_ino(dir), index, name, namelen); - BUG_ON(ret); + ret = btrfs_insert_uuid_subvol_item(trans, root->fs_info->uuid_root, + root_item.uuid, objectid); + if (ret) + btrfs_abort_transaction(trans, root, ret); + fail: trans->block_rsv = NULL; trans->bytes_reserved = 0; @@ -567,9 +578,10 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, * 1 - root item * 2 - root ref/backref * 1 - root of snapshot + * 1 - UUID item */ ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root, - &pending_snapshot->block_rsv, 7, + &pending_snapshot->block_rsv, 8, &pending_snapshot->qgroup_reserved); if (ret) goto out; @@ -2207,6 +2219,27 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_end_trans; } } + + ret = btrfs_del_uuid_subvol_item(trans, root->fs_info->uuid_root, + dest->root_item.uuid, + dest->root_key.objectid); + if (ret && ret != -ENOENT) { + btrfs_abort_transaction(trans, root, ret); + err = ret; + goto out_end_trans; + } + if (!btrfs_is_empty_uuid(dest->root_item.received_uuid)) { + ret = btrfs_del_uuid_received_subvol_item(trans, + root->fs_info->uuid_root, + dest->root_item.received_uuid, + dest->root_key.objectid); + if (ret && ret != -ENOENT) { + btrfs_abort_transaction(trans, root, ret); + err = ret; + goto out_end_trans; + } + } + out_end_trans: trans->block_rsv = NULL; trans->bytes_reserved = 0; @@ -2418,7 +2451,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices; int ret = 0; char *s_uuid = NULL; - char empty_uuid[BTRFS_UUID_SIZE] = {0}; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -2427,7 +2459,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) if (IS_ERR(di_args)) return PTR_ERR(di_args); - if (memcmp(empty_uuid, di_args->uuid, BTRFS_UUID_SIZE) != 0) + if (!btrfs_is_empty_uuid(di_args->uuid)) s_uuid = di_args->uuid; mutex_lock(&fs_devices->device_list_mutex); @@ -3952,6 +3984,7 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, struct btrfs_trans_handle *trans; struct timespec ct = CURRENT_TIME; int ret = 0; + int received_uuid_changed; ret = mnt_want_write_file(file); if (ret < 0) @@ -3981,7 +4014,11 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, goto out; } - trans = btrfs_start_transaction(root, 1); + /* + * 1 - root item + * 2 - uuid items (received uuid + subvol uuid) + */ + trans = btrfs_start_transaction(root, 3); if (IS_ERR(trans)) { ret = PTR_ERR(trans); trans = NULL; @@ -3992,6 +4029,14 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, sa->rtime.sec = ct.tv_sec; sa->rtime.nsec = ct.tv_nsec; + received_uuid_changed = memcmp(root_item->received_uuid, sa->uuid, + BTRFS_UUID_SIZE); + if (received_uuid_changed && + !btrfs_is_empty_uuid(root_item->received_uuid)) + btrfs_del_uuid_received_subvol_item(trans, + root->fs_info->uuid_root, + root_item->received_uuid, + root->root_key.objectid); memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE); btrfs_set_root_stransid(root_item, sa->stransid); btrfs_set_root_rtransid(root_item, sa->rtransid); @@ -4004,12 +4049,21 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, &root->root_key, &root->root_item); if (ret < 0) { btrfs_end_transaction(trans, root); - trans = NULL; goto out; - } else { - ret = btrfs_commit_transaction(trans, root); - if (ret < 0) + } + if (received_uuid_changed && !btrfs_is_empty_uuid(sa->uuid)) { + ret = btrfs_insert_uuid_received_subvol_item( + trans, root->fs_info->uuid_root, sa->uuid, + root->root_key.objectid); + if (ret < 0 && ret != -EEXIST) { + btrfs_abort_transaction(trans, root, ret); goto out; + } + } + ret = btrfs_commit_transaction(trans, root); + if (ret < 0) { + btrfs_abort_transaction(trans, root, ret); + goto out; } ret = copy_to_user(arg, sa, sizeof(*sa)); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 0544587..a3d92b3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1263,8 +1263,25 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, dentry->d_name.len * 2); parent_inode->i_mtime = parent_inode->i_ctime = CURRENT_TIME; ret = btrfs_update_inode_fallback(trans, parent_root, parent_inode); - if (ret) + if (ret) { + btrfs_abort_transaction(trans, root, ret); + goto fail; + } + ret = btrfs_insert_uuid_subvol_item(trans, fs_info->uuid_root, + new_uuid.b, objectid); + if (ret) { btrfs_abort_transaction(trans, root, ret); + goto fail; + } + if (!btrfs_is_empty_uuid(new_root_item->received_uuid)) { + ret = btrfs_insert_uuid_received_subvol_item( + trans, fs_info->uuid_root, new_root_item->received_uuid, + objectid); + if (ret && ret != -EEXIST) { + btrfs_abort_transaction(trans, root, ret); + goto fail; + } + } fail: pending->error = ret; dir_item_existed: -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/disk-io.c | 5 ++ fs/btrfs/volumes.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 157 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 73d797c..ee347d6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -23,6 +23,7 @@ #include <linux/highmem.h> #include <linux/fs.h> #include <linux/rwsem.h> +#include <linux/semaphore.h> #include <linux/completion.h> #include <linux/backing-dev.h> #include <linux/wait.h> @@ -1649,6 +1650,8 @@ struct btrfs_fs_info { struct btrfs_dev_replace dev_replace; atomic_t mutually_exclusive_operation_running; + + struct semaphore uuid_scan_sem; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3c2bf74..e99514e 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -31,6 +31,7 @@ #include <linux/migrate.h> #include <linux/ratelimit.h> #include <linux/uuid.h> +#include <linux/semaphore.h> #include <asm/unaligned.h> #include "compat.h" #include "ctree.h" @@ -2230,6 +2231,7 @@ int open_ctree(struct super_block *sb, init_rwsem(&fs_info->extent_commit_sem); init_rwsem(&fs_info->cleanup_work_sem); init_rwsem(&fs_info->subvol_sem); + sema_init(&fs_info->uuid_scan_sem, 1); fs_info->dev_replace.lock_owner = 0; atomic_set(&fs_info->dev_replace.nesting_level, 0); mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount); @@ -3447,6 +3449,9 @@ int close_ctree(struct btrfs_root *root) fs_info->closing = 1; smp_mb(); + /* wait for the uuid_scan task to finish */ + down(&fs_info->uuid_scan_sem); + /* pause restriper - we want to resume on mount */ btrfs_pause_balance(fs_info); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e6f5b88..d4fc33b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -26,6 +26,7 @@ #include <linux/ratelimit.h> #include <linux/kthread.h> #include <linux/raid/pq.h> +#include <linux/semaphore.h> #include <asm/div64.h> #include "compat.h" #include "ctree.h" @@ -3427,11 +3428,145 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) return 0; } +static int btrfs_uuid_scan_kthread(void *data) +{ + struct btrfs_fs_info *fs_info = data; + struct btrfs_root *root = fs_info->tree_root; + struct btrfs_key key; + struct btrfs_key max_key; + struct btrfs_path *path = NULL; + int ret = 0; + struct extent_buffer *eb; + int slot; + struct btrfs_root_item root_item; + u32 item_size; + struct btrfs_trans_handle *trans; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + key.objectid = 0; + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = 0; + + max_key.objectid = (u64)-1; + max_key.type = BTRFS_ROOT_ITEM_KEY; + max_key.offset = (u64)-1; + + path->keep_locks = 1; + + while (1) { + ret = btrfs_search_forward(root, &key, &max_key, path, 0); + if (ret) { + if (ret > 0) + ret = 0; + break; + } + + if (key.type != BTRFS_ROOT_ITEM_KEY || + (key.objectid < BTRFS_FIRST_FREE_OBJECTID && + key.objectid != BTRFS_FS_TREE_OBJECTID) || + key.objectid > BTRFS_LAST_FREE_OBJECTID) + goto skip; + + eb = path->nodes[0]; + slot = path->slots[0]; + item_size = btrfs_item_size_nr(eb, slot); + if (item_size < sizeof(root_item)) + goto skip; + + trans = NULL; + read_extent_buffer(eb, &root_item, + btrfs_item_ptr_offset(eb, slot), + (int)sizeof(root_item)); + if (btrfs_root_refs(&root_item) == 0) + goto skip; + if (!btrfs_is_empty_uuid(root_item.uuid)) { + /* + * 1 - subvol uuid item + * 1 - received_subvol uuid item + */ + trans = btrfs_start_transaction(fs_info->uuid_root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + ret = btrfs_insert_uuid_subvol_item(trans, + fs_info->uuid_root, + root_item.uuid, + key.objectid); + if (ret < 0) { + pr_warn("btrfs: insert_uuid_received_subvol_item failed %d\n", + ret); + btrfs_end_transaction(trans, + fs_info->uuid_root); + break; + } + } + + if (!btrfs_is_empty_uuid(root_item.received_uuid)) { + if (!trans) { + /* 1 - received_subvol uuid item */ + trans = btrfs_start_transaction( + fs_info->uuid_root, 1); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + } + ret = btrfs_insert_uuid_received_subvol_item( + trans, fs_info->uuid_root, + root_item.received_uuid, key.objectid); + if (ret < 0) { + pr_warn("btrfs: insert_uuid_received_subvol_item failed %d\n", + ret); + btrfs_end_transaction(trans, + fs_info->uuid_root); + break; + } + } + + if (trans) { + ret = btrfs_end_transaction(trans, fs_info->uuid_root); + if (ret) + break; + } + +skip: + btrfs_release_path(path); + if (key.offset < (u64)-1) { + key.offset++; + } else if (key.type < BTRFS_ROOT_ITEM_KEY) { + key.offset = 0; + key.type = BTRFS_ROOT_ITEM_KEY; + } else if (key.objectid < (u64)-1) { + key.offset = 0; + key.type = BTRFS_ROOT_ITEM_KEY; + key.objectid++; + } else { + break; + } + cond_resched(); + } + +out: + btrfs_free_path(path); + if (ret) + pr_warn("btrfs: btrfs_uuid_scan_kthread failed %d\n", ret); + up(&fs_info->uuid_scan_sem); + return 0; +} + int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) { struct btrfs_trans_handle *trans; struct btrfs_root *tree_root = fs_info->tree_root; struct btrfs_root *uuid_root; + struct task_struct *task; + int ret; /* * 1 - root node @@ -3452,8 +3587,21 @@ int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) fs_info->uuid_root = uuid_root; uuid_root->track_dirty = 1; - return btrfs_commit_transaction(trans, tree_root); + ret = btrfs_commit_transaction(trans, tree_root); + if (ret) + return ret; + + down(&fs_info->uuid_scan_sem); + task = kthread_run(btrfs_uuid_scan_kthread, fs_info, "btrfs-uuid"); + if (IS_ERR(task)) { + pr_warn("btrfs: failed to start uuid_scan task\n"); + up(&fs_info->uuid_scan_sem); + return PTR_ERR(task); + } + + return 0; } + /* * shrinking a device means finding all of the device extents past * the new size, and then following the back refs to the chunks. -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
In order to be able to detect the case that a filesystem is mounted with an old kernel, add a uuid-tree-gen field like the free space cache is doing it. It is part of the super block and written with each commit. Old kernels do not know this field and don''t update it. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 5 ++++- fs/btrfs/transaction.c | 1 + 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ee347d6..63fd4ff 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -482,9 +482,10 @@ struct btrfs_super_block { char label[BTRFS_LABEL_SIZE]; __le64 cache_generation; + __le64 uuid_tree_generation; /* future expansion */ - __le64 reserved[31]; + __le64 reserved[30]; u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE]; struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS]; } __attribute__ ((__packed__)); @@ -2827,6 +2828,8 @@ BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block, csum_type, 16); BTRFS_SETGET_STACK_FUNCS(super_cache_generation, struct btrfs_super_block, cache_generation, 64); +BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block, + uuid_tree_generation, 64); static inline int btrfs_super_csum_size(struct btrfs_super_block *s) { diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a3d92b3..eed52bb 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1331,6 +1331,7 @@ static void update_super_roots(struct btrfs_root *root) super->root_level = root_item->level; if (btrfs_test_opt(root, SPACE_CACHE)) super->cache_generation = root_item->generation; + super->uuid_tree_generation = root_item->generation; } int btrfs_transaction_in_commit(struct btrfs_fs_info *info) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 09:36 UTC
[PATCH v2 7/8] Btrfs: check UUID tree during mount if required
If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 4 ++ fs/btrfs/disk-io.c | 18 +++++- fs/btrfs/transaction.c | 3 +- fs/btrfs/uuid-tree.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.c | 83 ++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 6 files changed, 263 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 63fd4ff..bdfa18e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1653,6 +1653,7 @@ struct btrfs_fs_info { atomic_t mutually_exclusive_operation_running; struct semaphore uuid_scan_sem; + unsigned int update_uuid_tree_gen:1; }; /* @@ -3412,6 +3413,9 @@ void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root); /* uuid-tree.c */ +int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info, + int (*check_func)(struct btrfs_fs_info *, u8 *, u64, + u64)); int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, u64 *subvol_id); int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e99514e..30513c1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2047,7 +2047,8 @@ int open_ctree(struct super_block *sb, int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; - bool create_uuid_tree = false; + bool create_uuid_tree; + bool check_uuid_tree; tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info); extent_root = fs_info->extent_root = btrfs_alloc_root(fs_info); @@ -2629,8 +2630,12 @@ retry_root_backup: kfree(uuid_root); uuid_root = fs_info->uuid_root = NULL; create_uuid_tree = true; + check_uuid_tree = false; } else { uuid_root->track_dirty = 1; + create_uuid_tree = false; + check_uuid_tree + generation != btrfs_super_uuid_tree_generation(disk_super); } fs_info->generation = generation; @@ -2826,6 +2831,17 @@ retry_root_backup: close_ctree(tree_root); return ret; } + } else if (check_uuid_tree) { + pr_info("btrfs: checking UUID tree\n"); + ret = btrfs_check_uuid_tree(fs_info); + if (ret) { + pr_warn("btrfs: failed to check the UUID tree %d\n", + ret); + close_ctree(tree_root); + return ret; + } + } else { + fs_info->update_uuid_tree_gen = 1; } return 0; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index eed52bb..0c3f0d3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1331,7 +1331,8 @@ static void update_super_roots(struct btrfs_root *root) super->root_level = root_item->level; if (btrfs_test_opt(root, SPACE_CACHE)) super->cache_generation = root_item->generation; - super->uuid_tree_generation = root_item->generation; + if (root->fs_info->update_uuid_tree_gen) + super->uuid_tree_generation = root_item->generation; } int btrfs_transaction_in_commit(struct btrfs_fs_info *info) diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c index a44e13f..e5ab9ae 100644 --- a/fs/btrfs/uuid-tree.c +++ b/fs/btrfs/uuid-tree.c @@ -416,6 +416,162 @@ out: return ret; } +static int btrfs_uuid_iter_rem(struct btrfs_root *uuid_root, u8 *uuid, + u64 sub_item_type, u64 subid) +{ + struct btrfs_trans_handle *trans; + int ret; + + /* 1 - for the uuid item */ + trans = btrfs_start_transaction(uuid_root, 1); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } + + ret = btrfs_uuid_tree_rem(trans, uuid_root, uuid, + sub_item_type, subid); + if (ret && ret != -ENOENT) + btrfs_end_transaction(trans, uuid_root); + else + ret = btrfs_end_transaction(trans, uuid_root); + +out: + return ret; +} + +int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info, + int (*check_func)(struct btrfs_fs_info *, u8 *, u64, + u64)) +{ + struct btrfs_root *root = fs_info->uuid_root; + struct btrfs_key key; + struct btrfs_key max_key; + struct btrfs_path *path = NULL; + int ret = 0; + struct extent_buffer *eb; + int slot; + u32 item_size; + struct btrfs_uuid_item *ptr; + u64 subid; + unsigned long offset; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + key.objectid = 0; + key.type = BTRFS_UUID_KEY; + key.offset = 0; + + max_key.objectid = (u64)-1; + max_key.type = BTRFS_UUID_KEY; + max_key.offset = (u64)-1; + + path->keep_locks = 1; + + while (1) { +again: + cond_resched(); + ret = btrfs_search_forward(root, &key, &max_key, path, 0); + if (ret) { + if (ret > 0) + ret = 0; + break; + } + + if (key.type != BTRFS_UUID_KEY) + goto skip; + + eb = path->nodes[0]; + slot = path->slots[0]; + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); + item_size = btrfs_item_size_nr(eb, slot); + do { + u64 sub_item_type; + u64 sub_item_len; + + if (item_size < sizeof(*ptr)) { + pr_warn("btrfs: uuid item too short (%llu < %d)!\n", + (unsigned long long)item_size, + (int)sizeof(*ptr)); + goto skip; /* is this an old kernel? */ + } + sub_item_type = btrfs_uuid_type(eb, ptr); + sub_item_len = btrfs_uuid_len(eb, ptr); + ptr++; + item_size -= sizeof(*ptr); + if (sub_item_len * sizeof(u64) > item_size) { + pr_warn("btrfs: uuid item too short (%llu > %llu)!\n", + (unsigned long long)(sub_item_len * + sizeof(u64)), + (unsigned long long)item_size); + goto skip; + } + offset = (unsigned long)ptr; + ptr = (struct btrfs_uuid_item *) + (((char *)ptr) + sub_item_len * sizeof(u64)); + item_size -= sub_item_len * sizeof(u64); + while (sub_item_len) { + u8 uuid[BTRFS_UUID_SIZE]; + + put_unaligned_le64(key.objectid, uuid); + put_unaligned_le64(key.offset, + uuid + sizeof(u64)); + read_extent_buffer(eb, &subid, offset, + sizeof(subid)); + subid = le64_to_cpu(subid); + ret = check_func(fs_info, uuid, + sub_item_type, subid); + if (ret < 0) + goto out; + if (ret > 0) { + btrfs_release_path(path); + ret = btrfs_uuid_iter_rem( + fs_info->uuid_root, uuid, + sub_item_type, subid); + if (ret < 0) + goto out; + /* + * this might look inefficient, but + * the justification is that it is an + * exception that check_func returns 1, + * and that in the regular case only + * one or two entries per UUID exist. + */ + goto again; + } + sub_item_len--; + offset += sizeof(u64); + } + } while (item_size); + +skip: + btrfs_release_path(path); + if (key.offset < (u64)-1) { + key.offset++; + } else if (key.type < BTRFS_UUID_KEY) { + key.offset = 0; + key.type = BTRFS_UUID_KEY; + } else if (key.objectid < (u64)-1) { + key.offset = 0; + key.type = 0; + key.objectid++; + } else { + break; + } + } + +out: + btrfs_free_path(path); + if (ret) + pr_warn("btrfs: btrfs_uuid_tree_iterate failed %d\n", ret); + up(&fs_info->uuid_scan_sem); + return 0; +} + int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, u64 *subvol_id) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d4fc33b..6e89030 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3556,10 +3556,76 @@ out: btrfs_free_path(path); if (ret) pr_warn("btrfs: btrfs_uuid_scan_kthread failed %d\n", ret); + else + fs_info->update_uuid_tree_gen = 1; up(&fs_info->uuid_scan_sem); return 0; } +/* + * Callback for btrfs_uuid_tree_iterate(). + * returns: + * 0 check succeeded, the entry is not outdated. + * < 0 if an error occured. + * > 0 if the check failed, which means the caller shall remove the entry. + */ +static int btrfs_check_uuid_tree_entry(struct btrfs_fs_info *fs_info, + u8 *uuid, u64 type, u64 subid) +{ + struct btrfs_key key; + int ret = 0; + struct btrfs_root *subvol_root; + + if (type != BTRFS_UUID_ITEM_TYPE_SUBVOL && + type != BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL) + goto out; + + key.objectid = subid; + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = (u64)-1; + subvol_root = btrfs_read_fs_root_no_name(fs_info, &key); + if (IS_ERR(subvol_root)) { + ret = PTR_ERR(subvol_root); + if (ret == -ENOENT) + ret = 1; + goto out; + } + + switch (type) { + case BTRFS_UUID_ITEM_TYPE_SUBVOL: + if (memcmp(uuid, subvol_root->root_item.uuid, BTRFS_UUID_SIZE)) + ret = 1; + break; + case BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL: + if (memcmp(uuid, subvol_root->root_item.received_uuid, + BTRFS_UUID_SIZE)) + ret = 1; + break; + } + +out: + return ret; +} + +static int btrfs_uuid_rescan_kthread(void *data) +{ + struct btrfs_fs_info *fs_info = (struct btrfs_fs_info *)data; + int ret; + + /* + * 1st step is to iterate through the existing UUID tree and + * to delete all entries that contain outdated data. + * 2nd step is to add all missing entries to the UUID tree. + */ + ret = btrfs_uuid_tree_iterate(fs_info, btrfs_check_uuid_tree_entry); + if (ret < 0) { + pr_warn("btrfs: iterating uuid_tree failed %d\n", ret); + up(&fs_info->uuid_scan_sem); + return ret; + } + return btrfs_uuid_scan_kthread(data); +} + int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) { struct btrfs_trans_handle *trans; @@ -3594,6 +3660,7 @@ int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) down(&fs_info->uuid_scan_sem); task = kthread_run(btrfs_uuid_scan_kthread, fs_info, "btrfs-uuid"); if (IS_ERR(task)) { + /* fs_info->update_uuid_tree_gen remains 0 in all error case */ pr_warn("btrfs: failed to start uuid_scan task\n"); up(&fs_info->uuid_scan_sem); return PTR_ERR(task); @@ -3602,6 +3669,22 @@ int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) return 0; } +int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info) +{ + struct task_struct *task; + + down(&fs_info->uuid_scan_sem); + task = kthread_run(btrfs_uuid_rescan_kthread, fs_info, "btrfs-uuid"); + if (IS_ERR(task)) { + /* fs_info->update_uuid_tree_gen remains 0 in all error case */ + pr_warn("btrfs: failed to start uuid_rescan task\n"); + up(&fs_info->uuid_scan_sem); + return PTR_ERR(task); + } + + return 0; +} + /* * shrinking a device means finding all of the device extents past * the new size, and then following the back refs to the chunks. diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 3479532..b0ce9a1 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -296,6 +296,7 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info); int btrfs_pause_balance(struct btrfs_fs_info *fs_info); int btrfs_cancel_balance(struct btrfs_fs_info *fs_info); int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info); +int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info); int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, u64 *start, u64 *max_avail); -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 09:37 UTC
[PATCH v2 8/8] Btrfs: add mount option to force UUID tree checking
This should never be needed, but since all functions are there to check and rebuild the UUID tree, a mount option is added that allows to force this check and rebuild procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 3 ++- fs/btrfs/super.c | 8 +++++++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index bdfa18e..fd1d2fb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1966,6 +1966,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) +#define BTRFS_MOUNT_RESCAN_UUID_TREE (1 << 23) #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 30513c1..2e94b1a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2831,7 +2831,8 @@ retry_root_backup: close_ctree(tree_root); return ret; } - } else if (check_uuid_tree) { + } else if (check_uuid_tree || + btrfs_test_opt(tree_root, RESCAN_UUID_TREE)) { pr_info("btrfs: checking UUID tree\n"); ret = btrfs_check_uuid_tree(fs_info); if (ret) { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b6af7d3..4602c6a 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -317,7 +317,7 @@ enum { Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_check_integrity, Opt_check_integrity_including_extent_data, - Opt_check_integrity_print_mask, Opt_fatal_errors, + Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree, Opt_err, }; @@ -357,6 +357,7 @@ static match_table_t tokens = { {Opt_check_integrity, "check_int"}, {Opt_check_integrity_including_extent_data, "check_int_data"}, {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, + {Opt_rescan_uuid_tree, "rescan_uuid_tree"}, {Opt_fatal_errors, "fatal_errors=%s"}, {Opt_err, NULL}, }; @@ -551,6 +552,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_space_cache: btrfs_set_opt(info->mount_opt, SPACE_CACHE); break; + case Opt_rescan_uuid_tree: + btrfs_set_opt(info->mount_opt, RESCAN_UUID_TREE); + break; case Opt_no_space_cache: printk(KERN_INFO "btrfs: disabling disk space caching\n"); btrfs_clear_opt(info->mount_opt, SPACE_CACHE); @@ -925,6 +929,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",space_cache"); else seq_puts(seq, ",nospace_cache"); + if (btrfs_test_opt(root, RESCAN_UUID_TREE)) + seq_puts(seq, ",rescan_uuid_tree"); if (btrfs_test_opt(root, CLEAR_CACHE)) seq_puts(seq, ",clear_cache"); if (btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED)) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-May-14 10:44 UTC
Re: [PATCH v2 4/8] Btrfs: maintain subvolume items in the UUID tree
On Tue, May 14, 2013 at 11:36:56AM +0200, Stefan Behrens wrote:> When a new subvolume or snapshot is created, a new UUID item is added > to the UUID tree. Such items are removed when the subvolume is deleted. > The ioctl to set the received subvolume UUID is also touched and will > now also add this received UUID into the UUID tree together with the > ID of the subvolume. The latter is also done when read-only snapshots > are created which inherit all the send/receive information from the > parent subvolume. > > User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and > read in the UUID tree. > > Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/ioctl.c | 74 +++++++++++++++++++++++++++++++++++++++++++------- > fs/btrfs/transaction.c | 19 ++++++++++++- > 3 files changed, 83 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 6f9d760..73d797c 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -3634,6 +3634,7 @@ extern const struct dentry_operations btrfs_dentry_operations; > long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); > void btrfs_update_iflags(struct inode *inode); > void btrfs_inherit_iflags(struct inode *inode, struct inode *dir); > +int btrfs_is_empty_uuid(u8 *uuid); > int btrfs_defrag_file(struct inode *inode, struct file *file, > struct btrfs_ioctl_defrag_range_args *range, > u64 newer_than, unsigned long max_pages); > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index b2537c7..47bef5a 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -363,6 +363,13 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) > return 0; > } > > +int btrfs_is_empty_uuid(u8 *uuid) > +{ > + static char empty_uuid[BTRFS_UUID_SIZE] = {0}; > + > + return !memcmp(uuid, empty_uuid, BTRFS_UUID_SIZE); > +} > + > static noinline int create_subvol(struct inode *dir, > struct dentry *dentry, > char *name, int namelen, > @@ -396,7 +403,7 @@ static noinline int create_subvol(struct inode *dir, > * of create_snapshot(). > */ > ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, > - 7, &qgroup_reserved); > + 8, &qgroup_reserved); > if (ret) > return ret; > > @@ -518,9 +525,13 @@ static noinline int create_subvol(struct inode *dir, > ret = btrfs_add_root_ref(trans, root->fs_info->tree_root, > objectid, root->root_key.objectid, > btrfs_ino(dir), index, name, namelen); > - > BUG_ON(ret); >This uuid_root will not use trans->block_rsv but empty_rsv since it does not set ref_cow, so you don''t need to add one more to block_rsv, and same for the below cases. thanks, liubo> + ret = btrfs_insert_uuid_subvol_item(trans, root->fs_info->uuid_root, > + root_item.uuid, objectid); > + if (ret) > + btrfs_abort_transaction(trans, root, ret); > + > fail: > trans->block_rsv = NULL; > trans->bytes_reserved = 0; > @@ -567,9 +578,10 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, > * 1 - root item > * 2 - root ref/backref > * 1 - root of snapshot > + * 1 - UUID item > */ > ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root, > - &pending_snapshot->block_rsv, 7, > + &pending_snapshot->block_rsv, 8, > &pending_snapshot->qgroup_reserved); > if (ret) > goto out; > @@ -2207,6 +2219,27 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, > goto out_end_trans; > } > } > + > + ret = btrfs_del_uuid_subvol_item(trans, root->fs_info->uuid_root, > + dest->root_item.uuid, > + dest->root_key.objectid); > + if (ret && ret != -ENOENT) { > + btrfs_abort_transaction(trans, root, ret); > + err = ret; > + goto out_end_trans; > + } > + if (!btrfs_is_empty_uuid(dest->root_item.received_uuid)) { > + ret = btrfs_del_uuid_received_subvol_item(trans, > + root->fs_info->uuid_root, > + dest->root_item.received_uuid, > + dest->root_key.objectid); > + if (ret && ret != -ENOENT) { > + btrfs_abort_transaction(trans, root, ret); > + err = ret; > + goto out_end_trans; > + } > + } > + > out_end_trans: > trans->block_rsv = NULL; > trans->bytes_reserved = 0; > @@ -2418,7 +2451,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) > struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices; > int ret = 0; > char *s_uuid = NULL; > - char empty_uuid[BTRFS_UUID_SIZE] = {0}; > > if (!capable(CAP_SYS_ADMIN)) > return -EPERM; > @@ -2427,7 +2459,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) > if (IS_ERR(di_args)) > return PTR_ERR(di_args); > > - if (memcmp(empty_uuid, di_args->uuid, BTRFS_UUID_SIZE) != 0) > + if (!btrfs_is_empty_uuid(di_args->uuid)) > s_uuid = di_args->uuid; > > mutex_lock(&fs_devices->device_list_mutex); > @@ -3952,6 +3984,7 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, > struct btrfs_trans_handle *trans; > struct timespec ct = CURRENT_TIME; > int ret = 0; > + int received_uuid_changed; > > ret = mnt_want_write_file(file); > if (ret < 0) > @@ -3981,7 +4014,11 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, > goto out; > } > > - trans = btrfs_start_transaction(root, 1); > + /* > + * 1 - root item > + * 2 - uuid items (received uuid + subvol uuid) > + */ > + trans = btrfs_start_transaction(root, 3); > if (IS_ERR(trans)) { > ret = PTR_ERR(trans); > trans = NULL; > @@ -3992,6 +4029,14 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, > sa->rtime.sec = ct.tv_sec; > sa->rtime.nsec = ct.tv_nsec; > > + received_uuid_changed = memcmp(root_item->received_uuid, sa->uuid, > + BTRFS_UUID_SIZE); > + if (received_uuid_changed && > + !btrfs_is_empty_uuid(root_item->received_uuid)) > + btrfs_del_uuid_received_subvol_item(trans, > + root->fs_info->uuid_root, > + root_item->received_uuid, > + root->root_key.objectid); > memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE); > btrfs_set_root_stransid(root_item, sa->stransid); > btrfs_set_root_rtransid(root_item, sa->rtransid); > @@ -4004,12 +4049,21 @@ static long btrfs_ioctl_set_received_subvol(struct file *file, > &root->root_key, &root->root_item); > if (ret < 0) { > btrfs_end_transaction(trans, root); > - trans = NULL; > goto out; > - } else { > - ret = btrfs_commit_transaction(trans, root); > - if (ret < 0) > + } > + if (received_uuid_changed && !btrfs_is_empty_uuid(sa->uuid)) { > + ret = btrfs_insert_uuid_received_subvol_item( > + trans, root->fs_info->uuid_root, sa->uuid, > + root->root_key.objectid); > + if (ret < 0 && ret != -EEXIST) { > + btrfs_abort_transaction(trans, root, ret); > goto out; > + } > + } > + ret = btrfs_commit_transaction(trans, root); > + if (ret < 0) { > + btrfs_abort_transaction(trans, root, ret); > + goto out; > } > > ret = copy_to_user(arg, sa, sizeof(*sa)); > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 0544587..a3d92b3 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1263,8 +1263,25 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, > dentry->d_name.len * 2); > parent_inode->i_mtime = parent_inode->i_ctime = CURRENT_TIME; > ret = btrfs_update_inode_fallback(trans, parent_root, parent_inode); > - if (ret) > + if (ret) { > + btrfs_abort_transaction(trans, root, ret); > + goto fail; > + } > + ret = btrfs_insert_uuid_subvol_item(trans, fs_info->uuid_root, > + new_uuid.b, objectid); > + if (ret) { > btrfs_abort_transaction(trans, root, ret); > + goto fail; > + } > + if (!btrfs_is_empty_uuid(new_root_item->received_uuid)) { > + ret = btrfs_insert_uuid_received_subvol_item( > + trans, fs_info->uuid_root, new_root_item->received_uuid, > + objectid); > + if (ret && ret != -EEXIST) { > + btrfs_abort_transaction(trans, root, ret); > + goto fail; > + } > + } > fail: > pending->error = ret; > dir_item_existed: > -- > 1.8.2.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 14, 2013 at 11:36:55AM +0200, Stefan Behrens wrote:> This tree is not created by mkfs.btrfs. Therefore when a filesystem > is mounted writable and the UUID tree does not exist, this tree is > created if required. The tree is also added to the fs_info structure > and initialized, but this commit does not yet read or write UUID tree > elements. > > Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/disk-io.c | 40 +++++++++++++++++++++++++++++++++++++++- > fs/btrfs/volumes.c | 27 +++++++++++++++++++++++++++ > fs/btrfs/volumes.h | 1 + > 4 files changed, 68 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index ca9149a..6f9d760 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1305,6 +1305,7 @@ struct btrfs_fs_info { > struct btrfs_root *fs_root; > struct btrfs_root *csum_root; > struct btrfs_root *quota_root; > + struct btrfs_root *uuid_root; > > /* the log root tree is a directory of all the other log roots */ > struct btrfs_root *log_root_tree; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 4904d7f..3c2bf74 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1499,6 +1499,9 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info, > if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID) > return fs_info->quota_root ? fs_info->quota_root : > ERR_PTR(-ENOENT); > + if (location->objectid == BTRFS_UUID_TREE_OBJECTID) > + return fs_info->uuid_root ? fs_info->uuid_root : > + ERR_PTR(-ENOENT); > again: > spin_lock(&fs_info->fs_roots_radix_lock); > root = radix_tree_lookup(&fs_info->fs_roots_radix, > @@ -1955,6 +1958,10 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) > free_extent_buffer(info->quota_root->node); > free_extent_buffer(info->quota_root->commit_root); > } > + if (info->uuid_root) { > + free_extent_buffer(info->uuid_root->node); > + free_extent_buffer(info->uuid_root->commit_root); > + } > > info->tree_root->node = NULL; > info->tree_root->commit_root = NULL; > @@ -1968,6 +1975,10 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) > info->quota_root->node = NULL; > info->quota_root->commit_root = NULL; > } > + if (info->uuid_root) { > + info->uuid_root->node = NULL; > + info->uuid_root->commit_root = NULL; > + } > > if (chunk_root) { > free_extent_buffer(info->chunk_root->node); > @@ -2029,11 +2040,13 @@ int open_ctree(struct super_block *sb, > struct btrfs_root *chunk_root; > struct btrfs_root *dev_root; > struct btrfs_root *quota_root; > + struct btrfs_root *uuid_root; > struct btrfs_root *log_tree_root; > int ret; > int err = -EINVAL; > int num_backups_tried = 0; > int backup_index = 0; > + bool create_uuid_tree = false; > > tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info); > extent_root = fs_info->extent_root = btrfs_alloc_root(fs_info); > @@ -2041,9 +2054,10 @@ int open_ctree(struct super_block *sb, > chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info); > dev_root = fs_info->dev_root = btrfs_alloc_root(fs_info); > quota_root = fs_info->quota_root = btrfs_alloc_root(fs_info); > + uuid_root = fs_info->uuid_root = btrfs_alloc_root(fs_info); > > if (!tree_root || !extent_root || !csum_root || > - !chunk_root || !dev_root || !quota_root) { > + !chunk_root || !dev_root || !quota_root || !uuid_root) { > err = -ENOMEM; > goto fail; > } > @@ -2604,6 +2618,19 @@ retry_root_backup: > fs_info->pending_quota_state = 1; > } > > + ret = find_and_setup_root(tree_root, fs_info, > + BTRFS_UUID_TREE_OBJECTID, uuid_root); > + if (ret) { > + if (ret != -ENOENT) > + goto recovery_tree_root; > + > + kfree(uuid_root); > + uuid_root = fs_info->uuid_root = NULL; > + create_uuid_tree = true; > + } else { > + uuid_root->track_dirty = 1; > + } > + > fs_info->generation = generation; > fs_info->last_trans_committed = generation; > > @@ -2788,6 +2815,17 @@ retry_root_backup: > return ret; > } > > + if (create_uuid_tree) { > + pr_info("btrfs: creating UUID tree\n"); > + ret = btrfs_create_uuid_tree(fs_info); > + if (ret) { > + pr_warn("btrfs: failed to create the UUID tree %d\n", > + ret); > + close_ctree(tree_root); > + return ret; > + } > + } > + > return 0; > > fail_qgroup: > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index a191bac..e6f5b88 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3427,6 +3427,33 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) > return 0; > } > > +int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) > +{ > + struct btrfs_trans_handle *trans; > + struct btrfs_root *tree_root = fs_info->tree_root; > + struct btrfs_root *uuid_root; > + > + /* > + * 1 - root node > + * 1 - root item > + */ > + trans = btrfs_start_transaction(tree_root, 2); > + if (IS_ERR(trans)) > + return PTR_ERR(trans); > + > + uuid_root = btrfs_create_tree(trans, fs_info, > + BTRFS_UUID_TREE_OBJECTID); > + if (IS_ERR(uuid_root)) { > + btrfs_abort_transaction(trans, tree_root, > + PTR_ERR(uuid_root)); > + return PTR_ERR(uuid_root); > + } > + > + fs_info->uuid_root = uuid_root; > + uuid_root->track_dirty = 1;btrfs_create_tree() has already set track_dirty. thanks, liubo> + > + return btrfs_commit_transaction(trans, tree_root); > +} > /* > * shrinking a device means finding all of the device extents past > * the new size, and then following the back refs to the chunks. > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 845ccbb..3479532 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -295,6 +295,7 @@ int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info); > int btrfs_recover_balance(struct btrfs_fs_info *fs_info); > int btrfs_pause_balance(struct btrfs_fs_info *fs_info); > int btrfs_cancel_balance(struct btrfs_fs_info *fs_info); > +int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info); > int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); > int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, > u64 *start, u64 *max_avail); > -- > 1.8.2.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-May-14 10:55 UTC
Re: [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
On Tue, May 14, 2013 at 11:36:52AM +0200, Stefan Behrens wrote:> Mapping UUIDs to subvolume IDs is an operation with a high effort > today. Today, the algorithm even has quadratic effort (based on the > number of existing subvolumes), which means, that it takes minutes > to send/receive a single subvolume if 10,000 subvolumes exist. But > even linear effort would be too much since it is a waste. And these > data structures to allow mapping UUIDs to subvolume IDs are created > every time a btrfs send/receive instance is started. > > So the issue to address is that Btrfs send / receive does not work > as it is today when a high number of subvolumes exist. > > It is much more efficient to maintain a searchable persistent data > structure in the filesystem, one that is updated whenever a > subvolume/snapshot is created and deleted, and when the received > subvolume UUID is set by the btrfs-receive tool. > > Therefore kernel code is added that is able to maintain data > structures in the filesystem that allow to quickly search for a > given UUID and to retrieve the subvol ID. > > Now follows the lengthy justification, why a new tree was added > instead of using the existing root tree: > > The first approach was to not create another tree that holds UUID > items. Instead, the items should just go into the top root tree. > Unfortunately this confused the algorithm to assign the objectid > of subvolumes and snapshots. The reason is that > btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for > the first created subvol or snapshot after mounting a filesystem, > and this function simply searches for the largest used objectid in > the root tree keys to pick the next objectid to assign. Of course, > the UUID keys have always been the ones with the highest offset > value, and the next assigned subvol ID was wastefully huge. > > To use any other existing tree did not look proper. To apply a > workaround such as setting the objectid to zero in the UUID item > key and to implement collision handling would either add > limitations (in case of a btrfs_extend_item() approach to handle > the collisions) or a lot of complexity and source code (in case a > key would be looked up that is free of collisions). Adding new code > that introduces limitations is not good, and adding code that is > complex and lengthy for no good reason is also not good. That''s the > justification why a completely new tree was introduced.I''d appreciate if some performance number appear here since it''s a speedup. thanks, liubo> > v1 -> v2: > - All review comments from David Sterba, Josef Bacik and Jan Schmidt > are addressed. > The hugest change was to add a mechanism that handles the case that > the filesystem is mounted with an older kernel. Now that case is > detected when the filesystem is mounted with a newer kernel again, > and the UUID tree is updated in the background. > > Stefan Behrens (8): > Btrfs: introduce a tree for items that map UUIDs to something > Btrfs: support printing UUID tree elements > Btrfs: create UUID tree if required > Btrfs: maintain subvolume items in the UUID tree > Btrfs: fill UUID tree initially > Btrfs: introduce uuid-tree-gen field > Btrfs: check UUID tree during mount if required > Btrfs: add mount option to force UUID tree checking > > fs/btrfs/Makefile | 3 +- > fs/btrfs/ctree.h | 65 ++++- > fs/btrfs/disk-io.c | 62 ++++- > fs/btrfs/ioctl.c | 74 +++++- > fs/btrfs/print-tree.c | 82 +++++++ > fs/btrfs/super.c | 8 +- > fs/btrfs/transaction.c | 21 +- > fs/btrfs/uuid-tree.c | 637 +++++++++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/volumes.c | 258 ++++++++++++++++++++ > fs/btrfs/volumes.h | 2 + > 10 files changed, 1197 insertions(+), 15 deletions(-) > create mode 100644 fs/btrfs/uuid-tree.c > > -- > 1.8.2.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-14 17:11 UTC
Re: [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
On Tue, 14 May 2013 18:55:23 +0800, Liu Bo wrote:> On Tue, May 14, 2013 at 11:36:52AM +0200, Stefan Behrens wrote: >> Mapping UUIDs to subvolume IDs is an operation with a high effort >> today. Today, the algorithm even has quadratic effort (based on the >> number of existing subvolumes), which means, that it takes minutes >> to send/receive a single subvolume if 10,000 subvolumes exist. But >> even linear effort would be too much since it is a waste. And these >> data structures to allow mapping UUIDs to subvolume IDs are created >> every time a btrfs send/receive instance is started. >> >> So the issue to address is that Btrfs send / receive does not work >> as it is today when a high number of subvolumes exist. >> >> It is much more efficient to maintain a searchable persistent data >> structure in the filesystem, one that is updated whenever a >> subvolume/snapshot is created and deleted, and when the received >> subvolume UUID is set by the btrfs-receive tool. >> >> Therefore kernel code is added that is able to maintain data >> structures in the filesystem that allow to quickly search for a >> given UUID and to retrieve the subvol ID. >> >> Now follows the lengthy justification, why a new tree was added >> instead of using the existing root tree: >> >> The first approach was to not create another tree that holds UUID >> items. Instead, the items should just go into the top root tree. >> Unfortunately this confused the algorithm to assign the objectid >> of subvolumes and snapshots. The reason is that >> btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for >> the first created subvol or snapshot after mounting a filesystem, >> and this function simply searches for the largest used objectid in >> the root tree keys to pick the next objectid to assign. Of course, >> the UUID keys have always been the ones with the highest offset >> value, and the next assigned subvol ID was wastefully huge. >> >> To use any other existing tree did not look proper. To apply a >> workaround such as setting the objectid to zero in the UUID item >> key and to implement collision handling would either add >> limitations (in case of a btrfs_extend_item() approach to handle >> the collisions) or a lot of complexity and source code (in case a >> key would be looked up that is free of collisions). Adding new code >> that introduces limitations is not good, and adding code that is >> complex and lengthy for no good reason is also not good. That''s the >> justification why a completely new tree was introduced. > > I''d appreciate if some performance number appear here since it''s a speedup.That''s a good idea. The numbers are below in the table and there''s also a link to a chart. I stopped the measurement with the old version after 10000 subvolumes because it already took almost 13 minutes to send a single, empty subvolume. All the time is spent building a database, and this is done each time the btrfs send or receive tool is started to send or receive a single subvolume. The table shows the time it takes to send a single, empty subvolume depending on the number of subvolume that exist in the filesystem. # of subvols | without | with in filesystem | UUID tree | UUID tree --------------+------------+---------- 2 | 0m00.004s | 0m00.003s 1000 | 0m07.010s | 0m00.004s 2000 | 0m28.210s | 0m00.004s 3000 | 1m04.872s | 0m00.004s 4000 | 1m56.059s | 0m00.004s 5000 | 3m00.489s | 0m00.004s 6000 | 4m27.376s | 0m00.004s 7000 | 6m08.938s | 0m00.004s 8000 | 7m54.020s | 0m00.004s 9000 | 10m05.108s | 0m00.004s 10000 | 12m47.406s | 0m00.004s Or as a chart: http://btrfs.giantdisaster.de/Btrfs-send-recv-perf.pdf The Hardware: Intel(R) Xeon(R) CPU X3450 @ 2.67GHz 8 GB RAM 6 high performance SSDs The script: #!/bin/bash set -e MOUNT=/mnt2 umount $MOUNT || true mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv mount /dev/sdc $MOUNT btrfs subv create $MOUNT/0 btrfs subv snapshot -r $MOUNT/0 $MOUNT/0ro > /dev/null echo ''2 subvols'' time btrfs send $MOUNT/0ro > /dev/null umount $MOUNT mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv mount /dev/sdc $MOUNT for i in `seq 1000`; do btrfs subv create $MOUNT/$i for j in `seq 500`; do btrfs subv create $MOUNT/$i/$j > /dev/null btrfs subv snapshot -r $MOUNT/$i/$j $MOUNT/$i/${j}ro > /dev/null done echo $i ''k subvols'' time btrfs send $MOUNT/$i/1ro > /dev/null done -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-15 08:52 UTC
Re: [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
On Tue, 14 May 2013 19:11:54 +0200, Stefan Behrens wrote:> On Tue, 14 May 2013 18:55:23 +0800, Liu Bo wrote: >> On Tue, May 14, 2013 at 11:36:52AM +0200, Stefan Behrens wrote: >>> Mapping UUIDs to subvolume IDs is an operation with a high effort >>> today. Today, the algorithm even has quadratic effort (based on the >>> number of existing subvolumes), which means, that it takes minutes >>> to send/receive a single subvolume if 10,000 subvolumes exist. But >>> even linear effort would be too much since it is a waste. And these >>> data structures to allow mapping UUIDs to subvolume IDs are created >>> every time a btrfs send/receive instance is started. >>> >>> So the issue to address is that Btrfs send / receive does not work >>> as it is today when a high number of subvolumes exist. >>> >>> It is much more efficient to maintain a searchable persistent data >>> structure in the filesystem, one that is updated whenever a >>> subvolume/snapshot is created and deleted, and when the received >>> subvolume UUID is set by the btrfs-receive tool. >>> >>> Therefore kernel code is added that is able to maintain data >>> structures in the filesystem that allow to quickly search for a >>> given UUID and to retrieve the subvol ID. >>> >>> Now follows the lengthy justification, why a new tree was added >>> instead of using the existing root tree: >>> >>> The first approach was to not create another tree that holds UUID >>> items. Instead, the items should just go into the top root tree. >>> Unfortunately this confused the algorithm to assign the objectid >>> of subvolumes and snapshots. The reason is that >>> btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for >>> the first created subvol or snapshot after mounting a filesystem, >>> and this function simply searches for the largest used objectid in >>> the root tree keys to pick the next objectid to assign. Of course, >>> the UUID keys have always been the ones with the highest offset >>> value, and the next assigned subvol ID was wastefully huge. >>> >>> To use any other existing tree did not look proper. To apply a >>> workaround such as setting the objectid to zero in the UUID item >>> key and to implement collision handling would either add >>> limitations (in case of a btrfs_extend_item() approach to handle >>> the collisions) or a lot of complexity and source code (in case a >>> key would be looked up that is free of collisions). Adding new code >>> that introduces limitations is not good, and adding code that is >>> complex and lengthy for no good reason is also not good. That''s the >>> justification why a completely new tree was introduced. >> >> I''d appreciate if some performance number appear here since it''s a speedup. > > That''s a good idea. The numbers are below in the table and there''s also a link to a chart. > > I stopped the measurement with the old version after 10000 subvolumes because it already took almost 13 minutes to send a single, empty subvolume. All the time is spent building a database, and this is done each time the btrfs send or receive tool is started to send or receive a single subvolume. > > The table shows the time it takes to send a single, empty subvolume depending on the number of subvolume that exist in the filesystem. > > # of subvols | without | with > in filesystem | UUID tree | UUID tree > --------------+------------+---------- > 2 | 0m00.004s | 0m00.003s > 1000 | 0m07.010s | 0m00.004s > 2000 | 0m28.210s | 0m00.004s > 3000 | 1m04.872s | 0m00.004s > 4000 | 1m56.059s | 0m00.004s > 5000 | 3m00.489s | 0m00.004s > 6000 | 4m27.376s | 0m00.004s > 7000 | 6m08.938s | 0m00.004s > 8000 | 7m54.020s | 0m00.004s > 9000 | 10m05.108s | 0m00.004s > 10000 | 12m47.406s | 0m00.004s > > Or as a chart: > http://btrfs.giantdisaster.de/Btrfs-send-recv-perf.pdfThe table goes on like this for larger number of subvolumes (and the time value is always the time to transfer just _one_ of the subvolumes): # of subvols | without | with in filesystem | UUID tree | UUID tree --------------+------------+---------- 2 | 0m00.004s | 0m00.003s 1000 | 0m07.010s | 0m00.004s 2000 | 0m28.210s | 0m00.004s 3000 | 1m04.872s | 0m00.004s 4000 | 1m56.059s | 0m00.004s 5000 | 3m00.489s | 0m00.004s 6000 | 4m27.376s | 0m00.004s 7000 | 6m08.938s | 0m00.004s 8000 | 7m54.020s | 0m00.004s 9000 | 10m05.108s | 0m00.004s 10000 | 12m47.406s | 0m00.004s 11000 | 15m05.800s | 0m00.004s 12000 | 18m00.170s | 0m00.004s 13000 | 21m39.438s | 0m00.004s 14000 | 24m54.681s | 0m00.004s 15000 | 28m09.096s | 0m00.004s 16000 | 33m08.856s | 0m00.004s 17000 | 37m10.562s | 0m00.004s 18000 | 41m44.727s | 0m00.004s 19000 | 46m14.335s | 0m00.004s 20000 | 51m55.100s | 0m00.004s 21000 | 56m54.346s | 0m00.004s 22000 | 62m53.466s | 0m00.004s 23000 | 66m57.328s | 0m00.004s 24000 | 73m59.687s | 0m00.004s 25000 | 81m24.476s | 0m00.004s 26000 | 87m11.478s | 0m00.004s 27000 | 92m59.225s | 0m00.004s For 100,000 existing subvolumes, the calculated value is 22 hours 25 minutes 19 seconds to start btrfs send/receive to transfer a single subvolume. For 1,000,000 existing subvolumes, it would take 102 days. 30 years for 10 million existing subvolumes. And 30 years is unacceptable.> The Hardware: > Intel(R) Xeon(R) CPU X3450 @ 2.67GHz > 8 GB RAM > 6 high performance SSDs > > The script: > #!/bin/bash > set -e > MOUNT=/mnt2 > umount $MOUNT || true > mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv > mount /dev/sdc $MOUNT > btrfs subv create $MOUNT/0 > btrfs subv snapshot -r $MOUNT/0 $MOUNT/0ro > /dev/null > echo ''2 subvols'' > time btrfs send $MOUNT/0ro > /dev/null > umount $MOUNT > mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv > mount /dev/sdc $MOUNT > for i in `seq 1000`; do > btrfs subv create $MOUNT/$i > for j in `seq 500`; do > btrfs subv create $MOUNT/$i/$j > /dev/null > btrfs subv snapshot -r $MOUNT/$i/$j $MOUNT/$i/${j}ro > /dev/null > done > echo $i ''k subvols'' > time btrfs send $MOUNT/$i/1ro > /dev/null > done-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-May-15 13:12 UTC
Re: [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
On Tue, May 14, 2013 at 07:11:54PM +0200, Stefan Behrens wrote:> # of subvols | without | with > in filesystem | UUID tree | UUID tree > --------------+------------+---------- > 2 | 0m00.004s | 0m00.003s > 1000 | 0m07.010s | 0m00.004s > 2000 | 0m28.210s | 0m00.004s > 3000 | 1m04.872s | 0m00.004s > 4000 | 1m56.059s | 0m00.004s > 5000 | 3m00.489s | 0m00.004s > 6000 | 4m27.376s | 0m00.004s > 7000 | 6m08.938s | 0m00.004s > 8000 | 7m54.020s | 0m00.004s > 9000 | 10m05.108s | 0m00.004s > 10000 | 12m47.406s | 0m00.004sIt looks like a copy paste error in the 3rd column. No really, this is a great improvement! david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stefan Behrens
2013-May-15 15:39 UTC
Re: [PATCH v2 4/8] Btrfs: maintain subvolume items in the UUID tree
On Tue, 14 May 2013 18:44:11 +0800, Liu Bo wrote:> On Tue, May 14, 2013 at 11:36:56AM +0200, Stefan Behrens wrote: >> @@ -396,7 +403,7 @@ static noinline int create_subvol(struct inode *dir, >> * of create_snapshot(). >> */ >> ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, >> - 7, &qgroup_reserved); >> + 8, &qgroup_reserved); >> if (ret) >> return ret; > > This uuid_root will not use trans->block_rsv but empty_rsv since it does not set > ref_cow, so you don''t need to add one more to block_rsv, and same for > the below cases.Hi Liu Bo, Thanks for your review comments! You are right, this won''t work because the empty_block_rsv is used for the UUID tree as it is now. I need to avoid ENOSPC in the middle of the transaction. Can you please acknowledge or comment the following addition to get_block_rsv()? And the "plus 1" that I have added to the transaction reservations for subvolume/snapshot creation/destruction will work correctly afterwards. --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4259,6 +4259,9 @@ static struct btrfs_block_rsv *get_block_rsv( if (root == root->fs_info->csum_root && trans->adding_csums) block_rsv = trans->block_rsv; + if (root == root->fs_info->uuid_root) + block_rsv = trans->block_rsv; + if (!block_rsv) block_rsv = root->block_rsv;>> @@ -567,9 +578,10 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, >> * 1 - root item >> * 2 - root ref/backref >> * 1 - root of snapshot >> + * 1 - UUID item >> */ >> ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root, >> - &pending_snapshot->block_rsv, 7, >> + &pending_snapshot->block_rsv, 8, >> &pending_snapshot->qgroup_reserved); >> if (ret) >> goto out;-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-May-16 01:50 UTC
Re: [PATCH v2 4/8] Btrfs: maintain subvolume items in the UUID tree
On Wed, May 15, 2013 at 05:39:35PM +0200, Stefan Behrens wrote:> On Tue, 14 May 2013 18:44:11 +0800, Liu Bo wrote: > > On Tue, May 14, 2013 at 11:36:56AM +0200, Stefan Behrens wrote: > >> @@ -396,7 +403,7 @@ static noinline int create_subvol(struct inode *dir, > >> * of create_snapshot(). > >> */ > >> ret = btrfs_subvolume_reserve_metadata(root, &block_rsv, > >> - 7, &qgroup_reserved); > >> + 8, &qgroup_reserved); > >> if (ret) > >> return ret; > > > > This uuid_root will not use trans->block_rsv but empty_rsv since it does not set > > ref_cow, so you don''t need to add one more to block_rsv, and same for > > the below cases. > > Hi Liu Bo, > > Thanks for your review comments! > > You are right, this won''t work because the empty_block_rsv is used for > the UUID tree as it is now. > > I need to avoid ENOSPC in the middle of the transaction. Can you please > acknowledge or comment the following addition to get_block_rsv()? And > the "plus 1" that I have added to the transaction reservations for > subvolume/snapshot creation/destruction will work correctly afterwards.Looks good. (Although I''m more looking forward to Josef''s metadata enospc killer patch, that way we won''t ever need to deal with all of this buggy stuff ;)) thanks, liubo> > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4259,6 +4259,9 @@ static struct btrfs_block_rsv *get_block_rsv( > if (root == root->fs_info->csum_root && trans->adding_csums) > block_rsv = trans->block_rsv; > > + if (root == root->fs_info->uuid_root) > + block_rsv = trans->block_rsv; > + > if (!block_rsv) > block_rsv = root->block_rsv; > > > > >> @@ -567,9 +578,10 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, > >> * 1 - root item > >> * 2 - root ref/backref > >> * 1 - root of snapshot > >> + * 1 - UUID item > >> */ > >> ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root, > >> - &pending_snapshot->block_rsv, 7, > >> + &pending_snapshot->block_rsv, 8, > >> &pending_snapshot->qgroup_reserved); > >> if (ret) > >> goto out; >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-May-16 06:35 UTC
Re: [PATCH v2 1/8] Btrfs: introduce a tree for items that map UUIDs to something
On Tue, May 14, 2013 at 11:36:53AM +0200, Stefan Behrens wrote:> Mapping UUIDs to subvolume IDs is an operation with a high effort > today. Today, the algorithm even has quadratic effort (based on the > number of existing subvolumes), which means, that it takes minutes > to send/receive a single subvolume if 10,000 subvolumes exist. But > even linear effort would be too much since it is a waste. And these > data structures to allow mapping UUIDs to subvolume IDs are created > every time a btrfs send/receive instance is started. > > It is much more efficient to maintain a searchable persistent data > structure in the filesystem, one that is updated whenever a > subvolume/snapshot is created and deleted, and when the received > subvolume UUID is set by the btrfs-receive tool. > > Therefore kernel code is added with this commit that is able to > maintain data structures in the filesystem that allow to quickly > search for a given UUID and to retrieve data that is assigned to > this UUID, like which subvolume ID is related to this UUID. > > This commit adds a new tree to hold UUID-to-data mapping items. The > key of the items is the full UUID plus the key type BTRFS_UUID_KEY. > Multiple data blocks can be stored for a given UUID, a type/length/ > value scheme is used. > > Now follows the lengthy justification, why a new tree was added > instead of using the existing root tree: > > The first approach was to not create another tree that holds UUID > items. Instead, the items should just go into the top root tree. > Unfortunately this confused the algorithm to assign the objectid > of subvolumes and snapshots. The reason is that > btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for > the first created subvol or snapshot after mounting a filesystem, > and this function simply searches for the largest used objectid in > the root tree keys to pick the next objectid to assign. Of course, > the UUID keys have always been the ones with the highest offset > value, and the next assigned subvol ID was wastefully huge. > > To use any other existing tree did not look proper. To apply a > workaround such as setting the objectid to zero in the UUID item > key and to implement collision handling would either add > limitations (in case of a btrfs_extend_item() approach to handle > the collisions) or a lot of complexity and source code (in case a > key would be looked up that is free of collisions). Adding new code > that introduces limitations is not good, and adding code that is > complex and lengthy for no good reason is also not good. That''s the > justification why a completely new tree was introduced. > > Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> > --- > fs/btrfs/Makefile | 3 +- > fs/btrfs/ctree.h | 50 ++++++ > fs/btrfs/uuid-tree.c | 481 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 533 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index 3932224..a550dfc 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ > export.o tree-log.o free-space-cache.o zlib.o lzo.o \ > compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > - reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o > + reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > + uuid-tree.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index f415377..ca9149a 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -91,6 +91,9 @@ struct btrfs_ordered_sum; > /* holds quota configuration and tracking */ > #define BTRFS_QUOTA_TREE_OBJECTID 8ULL > > +/* for storing items that use the BTRFS_UUID_KEY */ > +#define BTRFS_UUID_TREE_OBJECTID 9ULL > + > /* for storing balance parameters in the root tree */ > #define BTRFS_BALANCE_OBJECTID -4ULL > > @@ -953,6 +956,18 @@ struct btrfs_dev_replace_item { > __le64 num_uncorrectable_read_errors; > } __attribute__ ((__packed__)); > > +/* for items that use the BTRFS_UUID_KEY */ > +#define BTRFS_UUID_ITEM_TYPE_SUBVOL 0 /* for UUIDs assigned to subvols */ > +#define BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL 1 /* for UUIDs assigned to > + * received subvols */ > + > +/* a sequence of such items is stored under the BTRFS_UUID_KEY */ > +struct btrfs_uuid_item { > + __le64 type; /* refer to BTRFS_UUID_ITEM_TYPE* defines above */ > + __le64 len; /* number of following 64bit values */Can we put it smaller, such as u8 or __le16? Will type and len be that long as 2^64? thanks, liubo> + __le64 subid[0]; /* sequence of subids */ > +} __attribute__ ((__packed__)); > + > /* different types of block groups (and chunks) */ > #define BTRFS_BLOCK_GROUP_DATA (1ULL << 0) > #define BTRFS_BLOCK_GROUP_SYSTEM (1ULL << 1) > @@ -1901,6 +1916,17 @@ struct btrfs_ioctl_defrag_range_args { > #define BTRFS_DEV_REPLACE_KEY 250 > > /* > + * Stores items that allow to quickly map UUIDs to something else. > + * These items are part of the filesystem UUID tree. > + * The key is built like this: > + * (UUID_upper_64_bits, BTRFS_UUID_KEY, UUID_lower_64_bits). > + */ > +#if BTRFS_UUID_SIZE != 16 > +#error "UUID items require BTRFS_UUID_SIZE == 16!" > +#endif > +#define BTRFS_UUID_KEY 251 > + > +/* > * string items are for debugging. They just store a short string of > * data in the FS > */ > @@ -2956,6 +2982,12 @@ BTRFS_SETGET_FUNCS(dev_replace_cursor_left, struct btrfs_dev_replace_item, > BTRFS_SETGET_FUNCS(dev_replace_cursor_right, struct btrfs_dev_replace_item, > cursor_right, 64); > > +/* btrfs_uuid_item */ > +BTRFS_SETGET_FUNCS(uuid_type, struct btrfs_uuid_item, type, 64); > +BTRFS_SETGET_FUNCS(uuid_len, struct btrfs_uuid_item, len, 64); > +BTRFS_SETGET_STACK_FUNCS(stack_uuid_type, struct btrfs_uuid_item, type, 64); > +BTRFS_SETGET_STACK_FUNCS(stack_uuid_len, struct btrfs_uuid_item, len, 64); > + > BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_src_devid, > struct btrfs_dev_replace_item, src_devid, 64); > BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cont_reading_from_srcdev_mode, > @@ -3372,6 +3404,24 @@ void btrfs_check_and_init_root_item(struct btrfs_root_item *item); > void btrfs_update_root_times(struct btrfs_trans_handle *trans, > struct btrfs_root *root); > > +/* uuid-tree.c */ > +int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, > + u64 *subvol_id); > +int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id); > +int btrfs_del_uuid_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id); > +int btrfs_lookup_uuid_received_subvol_item(struct btrfs_root *uuid_root, > + u8 *uuid, u64 *subvol_id); > +int btrfs_insert_uuid_received_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, > + u8 *uuid, u64 subvol_id); > +int btrfs_del_uuid_received_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id); > + > /* dir-item.c */ > int btrfs_check_dir_item_collision(struct btrfs_root *root, u64 dir, > const char *name, int name_len); > diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c > new file mode 100644 > index 0000000..a44e13f > --- /dev/null > +++ b/fs/btrfs/uuid-tree.c > @@ -0,0 +1,481 @@ > +/* > + * Copyright (C) STRATO AG 2013. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License v2 as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; if not, write to the > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, > + * Boston, MA 021110-1307, USA. > + */ > +#include <linux/uuid.h> > +#include <asm/unaligned.h> > +#include "ctree.h" > +#include "transaction.h" > +#include "disk-io.h" > +#include "print-tree.h" > + > + > +/* > + * One key is used to store a sequence of btrfs_uuid_item items. > + * Each item in the sequence contains a type information and a sequence of > + * ids (together with the information about the size of the sequence of ids). > + * {[btrfs_uuid_item type0 {id0, id1, ..., idN}], > + * ..., > + * [btrfs_uuid_item typeZ {id0, id1, ..., idN}]} > + * > + * It is forbidden to put multiple items with the same type under the same key. > + * Instead the sequence of ids is extended and used to store any additional > + * ids for the same item type. > + */ > + > +static void btrfs_uuid_to_key(u8 *uuid, struct btrfs_key *key) > +{ > + key->type = BTRFS_UUID_KEY; > + key->objectid = get_unaligned_le64(uuid); > + key->offset = get_unaligned_le64(uuid + sizeof(u64)); > +} > + > +static struct btrfs_uuid_item *btrfs_match_uuid_item_type( > + struct btrfs_path *path, u64 type) > +{ > + struct extent_buffer *eb; > + int slot; > + struct btrfs_uuid_item *ptr; > + u32 item_size; > + > + eb = path->nodes[0]; > + slot = path->slots[0]; > + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); > + item_size = btrfs_item_size_nr(eb, slot); > + do { > + u64 sub_item_type; > + u64 sub_item_len; > + > + if (item_size < sizeof(*ptr)) { > + pr_warn("btrfs: uuid item too short (%llu < %d)!\n", > + (unsigned long long)item_size, > + (int)sizeof(*ptr)); > + return NULL; > + } > + item_size -= sizeof(*ptr); > + sub_item_type = btrfs_uuid_type(eb, ptr); > + sub_item_len = btrfs_uuid_len(eb, ptr); > + if (sub_item_len * sizeof(u64) > item_size) { > + pr_warn("btrfs: uuid item too short (%llu > %llu)!\n", > + (unsigned long long)(sub_item_len * > + sizeof(u64)), > + (unsigned long long)item_size); > + return NULL; > + } > + if (sub_item_type == type) > + return ptr; > + item_size -= sub_item_len * sizeof(u64); > + ptr = 1 + (struct btrfs_uuid_item *) > + (((char *)ptr) + (sub_item_len * sizeof(u64))); > + } while (item_size); > + > + return NULL; > +} > + > +static int btrfs_uuid_tree_lookup_prepare(struct btrfs_root *uuid_root, > + u8 *uuid, u64 type, > + struct btrfs_path *path, > + struct btrfs_uuid_item **ptr) > +{ > + int ret; > + struct btrfs_key key; > + > + if (!uuid_root) { > + WARN_ON_ONCE(1); > + ret = -ENOENT; > + goto out; > + } > + > + btrfs_uuid_to_key(uuid, &key); > + > + ret = btrfs_search_slot(NULL, uuid_root, &key, path, 0, 0); > + if (ret < 0) > + goto out; > + if (ret > 0) { > + ret = -ENOENT; > + goto out; > + } > + > + *ptr = btrfs_match_uuid_item_type(path, type); > + if (!*ptr) { > + ret = -ENOENT; > + goto out; > + } > + > + ret = 0; > + > +out: > + return ret; > +} > + > +/* return -ENOENT for !found, < 0 for errors, or 0 if an item was found */ > +static int btrfs_uuid_tree_lookup(struct btrfs_root *uuid_root, u8 *uuid, > + u64 type, u64 subid) > +{ > + int ret; > + struct btrfs_path *path; > + struct extent_buffer *eb; > + struct btrfs_uuid_item *ptr; > + u64 sub_item_len; > + unsigned long offset; > + > + path = btrfs_alloc_path(); > + if (!path) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = btrfs_uuid_tree_lookup_prepare(uuid_root, uuid, type, path, &ptr); > + if (ret < 0) > + goto out; > + > + eb = path->nodes[0]; > + sub_item_len = btrfs_uuid_len(eb, ptr); > + WARN_ON_ONCE(sub_item_len == 0); > + ret = -ENOENT; > + ptr++; > + offset = (unsigned long)ptr; > + while (sub_item_len > 0) { > + u64 data; > + > + read_extent_buffer(eb, &data, offset, sizeof(data)); > + data = le64_to_cpu(data); > + if (data == subid) { > + ret = 0; > + break; > + } > + offset += sizeof(data); > + sub_item_len--; > + } > + > +out: > + btrfs_free_path(path); > + return ret; > +} > + > +/* return -ENOENT for !found, < 0 for errors, or 0 if an item was found */ > +static int btrfs_uuid_tree_lookup_any(struct btrfs_root *uuid_root, u8 *uuid, > + u64 type, u64 *subid) > +{ > + int ret; > + struct btrfs_path *path; > + struct extent_buffer *eb; > + struct btrfs_uuid_item *ptr; > + u64 sub_item_len; > + > + path = btrfs_alloc_path(); > + if (!path) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = btrfs_uuid_tree_lookup_prepare(uuid_root, uuid, type, path, &ptr); > + if (ret < 0) > + goto out; > + > + eb = path->nodes[0]; > + sub_item_len = btrfs_uuid_len(eb, ptr); > + WARN_ON_ONCE(sub_item_len == 0); > + if (sub_item_len > 0) { > + /* return first stored id */ > + read_extent_buffer(eb, subid, (unsigned long)(ptr + 1), > + sizeof(*subid)); > + *subid = le64_to_cpu(*subid); > + ret = 0; > + } else { > + ret = -ENOENT; > + } > + > +out: > + btrfs_free_path(path); > + return ret; > +} > + > +/* it is not checked whether the entry to add already exists */ > +static int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 type, u64 subid) > +{ > + int ret; > + struct btrfs_path *path = NULL; > + struct btrfs_key key; > + struct extent_buffer *eb; > + int slot; > + struct btrfs_uuid_item *ptr; > + u32 item_size; > + > + if (!uuid_root) { > + WARN_ON_ONCE(1); > + ret = -EINVAL; > + goto out; > + } > + > + btrfs_uuid_to_key(uuid, &key); > + > + path = btrfs_alloc_path(); > + if (!path) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = btrfs_insert_empty_item(trans, uuid_root, path, &key, > + sizeof(*ptr) + sizeof(subid)); > + if (ret == -EEXIST) { > + ptr = btrfs_match_uuid_item_type(path, type); > + if (ptr) { > + /* > + * An item with that type already exists. > + * Extend the item and store the subid at the > + * location of the first id of this type. > + */ > + unsigned long start; > + unsigned long move_dst; > + unsigned long move_src; > + unsigned long move_len; > + > + btrfs_extend_item(uuid_root, path, sizeof(subid)); > + ptr = (struct btrfs_uuid_item *) > + (((char *)ptr) - sizeof(subid)); > + eb = path->nodes[0]; > + slot = path->slots[0]; > + item_size = btrfs_item_size_nr(eb, slot); > + WARN_ON(btrfs_uuid_len(eb, ptr) == 0); > + btrfs_set_uuid_len(eb, ptr, > + btrfs_uuid_len(eb, ptr) + 1); > + ptr++; > + start = btrfs_item_ptr_offset(eb, slot); > + move_dst = ((unsigned long)ptr) + sizeof(subid); > + move_src = (unsigned long)ptr; > + move_len = item_size - (move_dst - start); > + memmove_extent_buffer(eb, move_dst, move_src, move_len); > + > + goto write_subid; > + } else { > + btrfs_extend_item(uuid_root, path, > + sizeof(*ptr) + sizeof(subid)); > + } > + } else if (ret < 0) { > + pr_warn("btrfs: insert uuid item failed %d (0x%016llx, 0x%016llx) type %llu!\n", > + ret, (unsigned long long)key.objectid, > + (unsigned long long)key.offset, > + (unsigned long long)type); > + goto out; > + } > + > + /* > + * Add an item for the type for the first time. Either add it behind > + * items with different types, or write the very first item. > + */ > + eb = path->nodes[0]; > + slot = path->slots[0]; > + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); > + item_size = btrfs_item_size_nr(eb, slot); > + ptr = (struct btrfs_uuid_item *) > + (((char *)ptr) + item_size - (sizeof(*ptr) + sizeof(subid))); > + btrfs_set_uuid_type(eb, ptr, type); > + btrfs_set_uuid_len(eb, ptr, 1); > + ptr++; > + > +write_subid: > + ret = 0; > + subid = cpu_to_le64(subid); > + write_extent_buffer(eb, &subid, (unsigned long)ptr, sizeof(subid)); > + btrfs_mark_buffer_dirty(eb); > + > +out: > + btrfs_free_path(path); > + return ret; > +} > + > +static int btrfs_uuid_tree_rem(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 type, u64 subid) > +{ > + int ret; > + struct btrfs_path *path = NULL; > + struct btrfs_key key; > + struct extent_buffer *eb; > + int slot; > + struct btrfs_uuid_item *ptr_start; > + unsigned long offset; > + u32 item_size; > + u64 id_index; > + unsigned long start; > + unsigned long move_dst; > + unsigned long move_src; > + unsigned long move_len; > + > + if (!uuid_root) { > + WARN_ON_ONCE(1); > + ret = -EINVAL; > + goto out; > + } > + > + btrfs_uuid_to_key(uuid, &key); > + > + path = btrfs_alloc_path(); > + if (!path) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = btrfs_search_slot(trans, uuid_root, &key, path, -1, 1); > + if (ret < 0) { > + pr_warn("btrfs: error %d while searching for uuid item!\n", > + ret); > + goto out; > + } > + if (ret > 0) { > + ret = -ENOENT; > + goto out; > + } > + > + ptr_start = btrfs_match_uuid_item_type(path, type); > + if (!ptr_start) { > + /* case 1: no entry for that type is found */ > + ret = -ENOENT; > + goto out; > + } > + > + eb = path->nodes[0]; > + slot = path->slots[0]; > + start = btrfs_item_ptr_offset(eb, slot); > + item_size = btrfs_item_size_nr(eb, slot); > + id_index = btrfs_uuid_len(eb, ptr_start); > + offset = (unsigned long)(ptr_start + 1); > + while (id_index > 0) { > + u64 found_subid; > + > + read_extent_buffer(eb, &found_subid, offset, > + sizeof(found_subid)); > + found_subid = le64_to_cpu(found_subid); > + if (found_subid == subid) > + break; > + offset += sizeof(found_subid); > + id_index--; > + } > + > + if (!id_index) { > + /* > + * case 2: an entry for that type is found, but no match > + * for the id > + */ > + ret = -ENOENT; > + goto out; > + } > + > + if (btrfs_uuid_len(eb, ptr_start) == 1) { > + if (item_size == sizeof(*ptr_start) + sizeof(subid)) { > + /* > + * case 3: no other types and ids are stored, > + * delete the full item > + */ > + ret = btrfs_del_item(trans, uuid_root, path); > + goto out; > + } else { > + /* > + * case 4: No other ids for the particular type > + * are stored, but other types exist. Shrink the > + * item by the size of an id plus the size of the > + * item for that type. > + */ > + move_dst = (unsigned long)ptr_start; > + } > + } else { > + /* > + * case 5: Other ids for that particular type are stored. > + * Shrink the item by the size of an id. > + */ > + move_dst = offset; > + WARN_ON(btrfs_uuid_len(eb, ptr_start) == 0); > + btrfs_set_uuid_len(eb, ptr_start, > + btrfs_uuid_len(eb, ptr_start) - 1); > + } > + move_src = offset + sizeof(subid); > + move_len = item_size - (move_src - start); > + > + memmove_extent_buffer(eb, move_dst, move_src, move_len); > + btrfs_truncate_item(uuid_root, path, item_size - (move_src - move_dst), > + 1); > + > +out: > + btrfs_free_path(path); > + return ret; > +} > + > +int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, > + u64 *subvol_id) > +{ > + return btrfs_uuid_tree_lookup_any(uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_SUBVOL, > + subvol_id); > +} > + > +int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id) > +{ > + int ret; > + > + ret = btrfs_uuid_tree_lookup(uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_SUBVOL, subvol_id); > + if (ret == -ENOENT) > + ret = btrfs_uuid_tree_add(trans, uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_SUBVOL, > + subvol_id); > + return ret; > +} > + > +int btrfs_del_uuid_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id) > +{ > + return btrfs_uuid_tree_rem(trans, uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_SUBVOL, subvol_id); > +} > + > +int btrfs_lookup_uuid_received_subvol_item(struct btrfs_root *uuid_root, > + u8 *uuid, u64 *subvol_id) > +{ > + return btrfs_uuid_tree_lookup_any(uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, > + subvol_id); > +} > + > +int btrfs_insert_uuid_received_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, > + u8 *uuid, u64 subvol_id) > +{ > + int ret; > + > + ret = btrfs_uuid_tree_lookup(uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, > + subvol_id); > + if (ret == -ENOENT) > + ret = btrfs_uuid_tree_add(trans, uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, > + subvol_id); > + return ret; > +} > + > +int btrfs_del_uuid_received_subvol_item(struct btrfs_trans_handle *trans, > + struct btrfs_root *uuid_root, u8 *uuid, > + u64 subvol_id) > +{ > + return btrfs_uuid_tree_rem(trans, uuid_root, uuid, > + BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL, > + subvol_id); > +} > -- > 1.8.2.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html