Currently btrfs has a limitation on the maximum number of hard links an inode can have. Specifically, links are stored in an array of ref items: struct btrfs_inode_ref { __le64 index; __le16 name_len; /* name goes here */ } __attribute__ ((__packed__)); The ref arrays are found via key triple: (inode objectid, BTRFS_INODE_EXTREF_KEY, parent dir objectid) Since items can not exceed the size of a leaf, the total number of links that can be stored for a given inode / parent dir pair is limited to under 4k. This works fine for the most common case of few to only a handful of links. Once the link count gets higher however, we begin to return EMLINK. The following patches fix this situation by introducing a new ref item: struct btrfs_inode_extref { __le64 parent_objectid; __le64 index; __le16 name_len; __u8 name[0]; /* name goes here */ } __attribute__ ((__packed__)); Extended refs behave differently from ref arrays in several key areas. Each extended refs is it''s own item so there is no ref array (and therefore no limit on size). As a result, we must use a different addressing scheme. Extended ref keys look like: (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) Where hash is defined as a function of the parent objectid and link name. This effectively fixes the limitation, though we have a slightly less efficient packing of link data. To keep the best of both worlds then, I implemented the following behavior: Extended refs don''t replace the existing ref array. An inode gets an extended ref for a given link _only_ after the ref array has been filled. So the most common cases shouldn''t actually see any difference in performance or disk usage as they''ll never get to the point where we''re using an extended ref. It''s important while reading the patches however that there''s still the possibility that we can have a set of operations that grow out an inode ref array (adding some extended refs) and then remove only the refs in the array. I don''t really see this being common but it''s a case we always have to consider when coding these changes. Right now there is a limitation for extrefs in that we''re not handling the possibility of a hash collision. There are two ways I see we can deal with this: We can use a 56-bit hash and keep a generation counter in the lower 8 bits of the offset field. The cost would be an additional tree search (between offset <hash>00 and <hash>FF) if we don''t find exactly the name we were looking for. An alternative solution to dealing with collisions could be to emulate the dir-item insertion code - specifically something like insert_with_overflow() which will stuff multiple items under one key. I tend to prefer the idea of simply including a generation in the key offset however since it maintains the 1:1 relationship of keys to names which turns out to be much nicer to code for in my honest opinion. Also none of the code which iterates the tree looking for refs would have to change as the only difference is in the key offset and not in the actual item structure. Testing wise, the patches are in an intermediate state. I''ve debugged a fair bit but I''m certain there''s gremlins lurking in there. The basic namespace operations work well enough (link, unlink, etc). I''ve done light testing of my changes in backref.c by exercising BTRFS_IOC_INO_PATHS. The changes in tree-log.c need the most review and testing - I haven''t really figured out a great way to exercise the code in tree-log yet (suggestions would be great!). Finally, these patches are based off Linux v3.3. --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This patch adds basic support for extended inode refs. This includes support for link and unlink of the refs, which basically gets us support for rename as well. Inode creation does not need changing - extended refs are only added after the ref array is full. Signed-off-by: Mark Fasheh <mfasheh@suse.de> --- fs/btrfs/ctree.h | 50 +++++++++-- fs/btrfs/inode-item.c | 244 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/btrfs/inode.c | 20 ++-- 3 files changed, 288 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 80b6486..5fc77ee 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -143,6 +143,13 @@ struct btrfs_ordered_sum; */ #define BTRFS_NAME_LEN 255 +/* + * Theoretical limit is larger, but we keep this down to a sane + * value. That should limit greatly the possibility of collisions on + * inode ref items. + */ +#define BTRFS_LINK_MAX 65535U + /* 32 bytes in various csum fields */ #define BTRFS_CSUM_SIZE 32 @@ -462,13 +469,16 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS (1ULL << 2) #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO (1ULL << 3) +#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) + #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL #define BTRFS_FEATURE_INCOMPAT_SUPP \ (BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \ BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ - BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO) + BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \ + BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) /* * A leaf is full of items. offset and size tell us where to find @@ -615,6 +625,14 @@ struct btrfs_inode_ref { /* name goes here */ } __attribute__ ((__packed__)); +struct btrfs_inode_extref { + __le64 parent_objectid; + __le64 index; + __le16 name_len; + __u8 name[0]; + /* name goes here */ +} __attribute__ ((__packed__)); + struct btrfs_timespec { __le64 sec; __le32 nsec; @@ -1400,6 +1418,7 @@ struct btrfs_ioctl_defrag_range_args { */ #define BTRFS_INODE_ITEM_KEY 1 #define BTRFS_INODE_REF_KEY 12 +#define BTRFS_INODE_EXTREF_KEY 13 #define BTRFS_XATTR_ITEM_KEY 24 #define BTRFS_ORPHAN_ITEM_KEY 48 /* reserve 2-15 close to the inode for later flexibility */ @@ -1701,6 +1720,13 @@ BTRFS_SETGET_STACK_FUNCS(block_group_flags, BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16); BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64); +/* struct btrfs_inode_extref */ +BTRFS_SETGET_FUNCS(inode_extref_parent, struct btrfs_inode_extref, + parent_objectid, 64); +BTRFS_SETGET_FUNCS(inode_extref_name_len, struct btrfs_inode_extref, + name_len, 16); +BTRFS_SETGET_FUNCS(inode_extref_index, struct btrfs_inode_extref, index, 64); + /* struct btrfs_inode_item */ BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 64); BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64); @@ -2791,12 +2817,12 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, const char *name, int name_len, u64 inode_objectid, u64 ref_objectid, u64 *index); -struct btrfs_inode_ref * -btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - struct btrfs_path *path, - const char *name, int name_len, - u64 inode_objectid, u64 ref_objectid, int mod); +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_path *path, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, int mod, + u64 *ret_index); int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, u64 objectid); @@ -2804,6 +2830,16 @@ int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, struct btrfs_key *location, int mod); +struct btrfs_inode_extref * +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_path *path, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, int ins_len, + int cow); + +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len); + /* file-item.c */ int btrfs_del_csums(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 len); diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c index baa74f3..cc5e854 100644 --- a/fs/btrfs/inode-item.c +++ b/fs/btrfs/inode-item.c @@ -16,6 +16,7 @@ * Boston, MA 021110-1307, USA. */ +#include <linux/crc32c.h> #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -49,18 +50,56 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name, return 0; } -struct btrfs_inode_ref * +static int find_name_in_ext_backref(struct btrfs_path *path, const char *name, + int name_len, struct btrfs_inode_extref **ref_ret) +{ + struct extent_buffer *leaf; + struct btrfs_inode_extref *ref; + unsigned long ptr; + unsigned long name_ptr; + u32 item_size; + + leaf = path->nodes[0]; + item_size = btrfs_item_size_nr(leaf, path->slots[0]); + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); + ref = (struct btrfs_inode_extref *)ptr; + + /* + * There isn''t actually anything to "search" in an extended + * backref - one name is directly attached to each + * item. Instead what we''re really doing here is some sort of + * sanity check - does the name exist where it should have + * been found. If all is well, we''ll return success and the + * inode ref object. + */ + name_ptr = (unsigned long)(&ref->name); + if (btrfs_inode_extref_name_len(leaf, ref) == name_len + && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { + *ref_ret = ref; + return 1; + } + return 0; +} + +/* + * Figure the key offset of an extended inode ref + */ +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len) +{ + return (u64) crc32c(parent_objectid, name, name_len); +} + +static struct btrfs_inode_ref * btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - struct btrfs_path *path, - const char *name, int name_len, - u64 inode_objectid, u64 ref_objectid, int mod) + struct btrfs_root *root, + struct btrfs_path *path, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, int ins_len, + int cow) { + int ret; struct btrfs_key key; struct btrfs_inode_ref *ref; - int ins_len = mod < 0 ? -1 : 0; - int cow = mod != 0; - int ret; key.objectid = inode_objectid; key.type = BTRFS_INODE_REF_KEY; @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, return ref; } -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, +struct btrfs_inode_extref * +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_path *path, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, int ins_len, + int cow) +{ + int ret; + struct btrfs_key key; + struct btrfs_inode_extref *ref; + + key.objectid = inode_objectid; + key.type = BTRFS_INODE_EXTREF_KEY; + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); + + ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); + if (ret < 0) + return ERR_PTR(ret); + if (ret > 0) + return NULL; + if (!find_name_in_ext_backref(path, name, name_len, &ref)) + return NULL; + return ref; +} + +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_path *path, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, int mod, + u64 *ret_index) +{ + struct btrfs_inode_ref *ref1; + struct btrfs_inode_extref *ref2; + int ins_len = mod < 0 ? -1 : 0; + int cow = mod != 0; + + ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len, + inode_objectid, ref_objectid, ins_len, + cow); + if (IS_ERR(ref1)) + return PTR_ERR(ref1); + + if (ref1 != NULL) { + *ret_index = btrfs_inode_ref_index(path->nodes[0], ref1); + return 0; + } + + btrfs_release_path(path); + + ref2 = btrfs_lookup_inode_extref(trans, root, path, name, + name_len, inode_objectid, + ref_objectid, ins_len, cow); + if (IS_ERR(ref2)) + return PTR_ERR(ref2); + + if (ref2) { + *ret_index = btrfs_inode_extref_index(path->nodes[0], ref2); + return 0; + } + + return -ENOENT; +} + +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans, struct btrfs_root *root, const char *name, int name_len, u64 inode_objectid, u64 ref_objectid, u64 *index) { struct btrfs_path *path; struct btrfs_key key; + struct btrfs_inode_extref *ref2; + struct extent_buffer *leaf; + int ret; + + key.objectid = inode_objectid; + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + path->leave_spinning = 1; + + ret = btrfs_search_slot(trans, root, &key, path, -1, 1); + if (ret > 0) { + ret = -ENOENT; + goto out; + } else if (ret < 0) { + goto out; + } + + /* + * Sanity check - did we find the right item for this name? + * This should always succeed so error here will make the FS + * readonly. + */ + if (!find_name_in_ext_backref(path, name, name_len, &ref2)) { + btrfs_std_error(root->fs_info, -ENOENT); + ret = -EROFS; + goto out; + } + + leaf = path->nodes[0]; + if (index) + *index = btrfs_inode_extref_index(leaf, ref2); + + ret = btrfs_del_item(trans, root, path); + +out: + btrfs_free_path(path); + + return ret; +} + +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, u64 *index) +{ + struct btrfs_path *path; + struct btrfs_key key; struct btrfs_inode_ref *ref; struct extent_buffer *leaf; unsigned long ptr; unsigned long item_start; u32 item_size; u32 sub_item_len; - int ret; + int ret, search_ext_refs = 0; int del_len = name_len + sizeof(*ref); key.objectid = inode_objectid; @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, ret = btrfs_search_slot(trans, root, &key, path, -1, 1); if (ret > 0) { ret = -ENOENT; + search_ext_refs = 1; goto out; } else if (ret < 0) { goto out; } if (!find_name_in_backref(path, name, name_len, &ref)) { ret = -ENOENT; + search_ext_refs = 1; goto out; } leaf = path->nodes[0]; @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, item_size - sub_item_len, 1); out: btrfs_free_path(path); + + if (search_ext_refs) { + /* + * No refs were found, or we could not find the + * name in our ref array. Find and remove the extended + * inode ref then. + */ + return btrfs_del_inode_extref(trans, root, name, name_len, + inode_objectid, ref_objectid, index); + } + + return ret; +} + +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + const char *name, int name_len, + u64 inode_objectid, u64 ref_objectid, u64 index) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_inode_extref *ref; + unsigned long ptr; + int ret; + int ins_len = name_len + sizeof(*ref); + + key.objectid = inode_objectid; + key.type = BTRFS_INODE_EXTREF_KEY; + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + path->leave_spinning = 1; + ret = btrfs_insert_empty_item(trans, root, path, &key, + ins_len); + if (ret < 0) + goto out; + + ref = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_inode_extref); + + btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len); + btrfs_set_inode_extref_index(path->nodes[0], ref, index); + btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid); + + ptr = (unsigned long)&ref->name; + write_extent_buffer(path->nodes[0], name, ptr, name_len); + btrfs_mark_buffer_dirty(path->nodes[0]); + +out: + btrfs_free_path(path); return ret; } @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans, out: btrfs_free_path(path); + + if (ret == -EMLINK) { + struct btrfs_super_block *disk_super = root->fs_info->super_copy; + /* We ran out of space in the ref array. Need to + * add an extended ref. */ + if (btrfs_super_incompat_flags(disk_super) + & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) + ret = btrfs_insert_inode_extref(trans, root, name, + name_len, + inode_objectid, + ref_objectid, index); + } + return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 892b347..0ca525d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(dir)->root; struct btrfs_path *path; - struct btrfs_inode_ref *ref; struct btrfs_dir_item *di; struct inode *inode = dentry->d_inode; u64 index; @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, } btrfs_release_path(path); - ref = btrfs_lookup_inode_ref(trans, root, path, - dentry->d_name.name, dentry->d_name.len, - ino, dir_ino, 0); - if (IS_ERR(ref)) { - err = PTR_ERR(ref); + ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name, + dentry->d_name.len, ino, dir_ino, 0, &index); + if (ret) { + err = ret; goto out; } - BUG_ON(!ref); + if (check_path_shared(root, path)) goto out; - index = btrfs_inode_ref_index(path->nodes[0], ref); + btrfs_release_path(path); /* @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY); key[0].offset = 0; + /* + * Start new inodes with a V1 ref. This is slightly more + * efficient for small numbers of hard links. + */ key[1].objectid = objectid; btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY); key[1].offset = ref_objectid; @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, if (root->objectid != BTRFS_I(inode)->root->objectid) return -EXDEV; - if (inode->i_nlink == ~0U) + if (inode->i_nlink >= BTRFS_LINK_MAX) return -EMLINK; err = btrfs_set_inode_index(dir, &index); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Teach tree-log.c about extended inode refs. In particular, we have to adjust the behavior of inode ref replay as well as log tree recovery to account for the existence of extended refs. Signed-off-by: Mark Fasheh <mfasheh@suse.de> --- fs/btrfs/tree-log.c | 320 +++++++++++++++++++++++++++++++++++++++++--------- fs/btrfs/tree-log.h | 4 + 2 files changed, 266 insertions(+), 58 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 966cc74..d69b07a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -23,6 +23,7 @@ #include "disk-io.h" #include "locking.h" #include "print-tree.h" +#include "backref.h" #include "compat.h" #include "tree-log.h" @@ -748,6 +749,7 @@ static noinline int backref_in_log(struct btrfs_root *log, { struct btrfs_path *path; struct btrfs_inode_ref *ref; + struct btrfs_inode_extref *extref; unsigned long ptr; unsigned long ptr_end; unsigned long name_ptr; @@ -764,8 +766,24 @@ static noinline int backref_in_log(struct btrfs_root *log, if (ret != 0) goto out; - item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]); ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); + + if (key->type == BTRFS_INODE_EXTREF_KEY) { + extref = (struct btrfs_inode_extref *)ptr; + + found_name_len = btrfs_inode_extref_name_len(path->nodes[0], + extref); + if (found_name_len == namelen) { + name_ptr = (unsigned long)&extref->name; + ret = memcmp_extent_buffer(path->nodes[0], name, + name_ptr, namelen); + if (ret == 0) + match = 1; + } + goto out; + } + + item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]); ptr_end = ptr + item_size; while (ptr < ptr_end) { ref = (struct btrfs_inode_ref *)ptr; @@ -786,7 +804,6 @@ out: return match; } - /* * replay one inode back reference item found in the log tree. * eb, slot and key refer to the buffer and key found in the log tree. @@ -801,15 +818,20 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, struct btrfs_key *key) { struct btrfs_inode_ref *ref; + struct btrfs_inode_extref *extref; struct btrfs_dir_item *di; + struct btrfs_key search_key; struct inode *dir; struct inode *inode; unsigned long ref_ptr; unsigned long ref_end; - char *name; - int namelen; + char *name, *victim_name; + int namelen, victim_name_len; int ret; int search_done = 0; + int log_ref_ver = 0; + u64 parent_objectid, inode_objectid, ref_index; + struct extent_buffer *leaf; /* * it is possible that we didn''t log all the parent directories @@ -817,32 +839,56 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, * copy the back ref in. The link count fixup code will take * care of the rest */ - dir = read_one_inode(root, key->offset); + + if (key->type == BTRFS_INODE_EXTREF_KEY) { + log_ref_ver = 1; + + ref_ptr = btrfs_item_ptr_offset(eb, slot); + + /* So that we don''t loop back looking for old style log refs. */ + ref_end = ref_ptr; + + extref = (struct btrfs_inode_extref *) btrfs_item_ptr_offset(eb, slot); + namelen = btrfs_inode_extref_name_len(eb, extref); + name = kmalloc(namelen, GFP_NOFS); + + read_extent_buffer(eb, name, (unsigned long)&extref->name, + namelen); + + ref_index = btrfs_inode_extref_index(eb, extref); + parent_objectid = btrfs_inode_extref_parent(eb, extref); + } else { + parent_objectid = key->offset; + + ref_ptr = btrfs_item_ptr_offset(eb, slot); + ref_end = ref_ptr + btrfs_item_size_nr(eb, slot); + + ref = (struct btrfs_inode_ref *)ref_ptr; + namelen = btrfs_inode_ref_name_len(eb, ref); + name = kmalloc(namelen, GFP_NOFS); + BUG_ON(!name); + + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); + + ref_index = btrfs_inode_ref_index(eb, ref); + } + + inode_objectid = key->objectid; + + dir = read_one_inode(root, parent_objectid); if (!dir) return -ENOENT; - inode = read_one_inode(root, key->objectid); + inode = read_one_inode(root, inode_objectid); if (!inode) { iput(dir); return -EIO; } - ref_ptr = btrfs_item_ptr_offset(eb, slot); - ref_end = ref_ptr + btrfs_item_size_nr(eb, slot); - again: - ref = (struct btrfs_inode_ref *)ref_ptr; - - namelen = btrfs_inode_ref_name_len(eb, ref); - name = kmalloc(namelen, GFP_NOFS); - BUG_ON(!name); - - read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); - /* if we already have a perfect match, we''re done */ if (inode_in_dir(root, path, btrfs_ino(dir), btrfs_ino(inode), - btrfs_inode_ref_index(eb, ref), - name, namelen)) { + ref_index, name, namelen)) { goto out; } @@ -857,19 +903,23 @@ again: if (search_done) goto insert; - ret = btrfs_search_slot(NULL, root, key, path, 0, 0); + /* Search old style refs */ + search_key.objectid = inode_objectid; + search_key.type = BTRFS_INODE_REF_KEY; + search_key.offset = parent_objectid; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); if (ret == 0) { - char *victim_name; - int victim_name_len; struct btrfs_inode_ref *victim_ref; unsigned long ptr; unsigned long ptr_end; - struct extent_buffer *leaf = path->nodes[0]; + + leaf = path->nodes[0]; /* are we trying to overwrite a back ref for the root directory * if so, just jump out, we''re done */ - if (key->objectid == key->offset) + if (search_key.objectid == search_key.offset) goto out_nowrite; /* check all the names in this back reference to see @@ -889,7 +939,7 @@ again: (unsigned long)(victim_ref + 1), victim_name_len); - if (!backref_in_log(log, key, victim_name, + if (!backref_in_log(log, &search_key, victim_name, victim_name_len)) { btrfs_inc_nlink(inode); btrfs_release_path(path); @@ -902,19 +952,44 @@ again: ptr = (unsigned long)(victim_ref + 1) + victim_name_len; } BUG_ON(ret); - - /* - * NOTE: we have searched root tree and checked the - * coresponding ref, it does not need to check again. - */ - search_done = 1; } btrfs_release_path(path); + /* Same search but for extended refs */ + extref = btrfs_lookup_inode_extref(NULL, root, path, name, namelen, + inode_objectid, parent_objectid, 0, + 0); + if (extref && !IS_ERR(extref)) { + victim_name_len = btrfs_inode_extref_name_len(eb, extref); + victim_name = kmalloc(namelen, GFP_NOFS); + leaf = path->nodes[0]; + read_extent_buffer(eb, name, (unsigned long)&extref->name, namelen); + + search_key.objectid = inode_objectid; + search_key.type = BTRFS_INODE_EXTREF_KEY; + search_key.offset = btrfs_extref_key_off(parent_objectid, name, namelen); + if (!backref_in_log(log, &search_key, victim_name, + victim_name_len)) { + btrfs_inc_nlink(inode); + btrfs_release_path(path); + + ret = btrfs_unlink_inode(trans, root, dir, + inode, victim_name, + victim_name_len); + } + kfree(victim_name); + BUG_ON(ret); + } + + /* + * NOTE: we have searched root tree and checked the + * coresponding refs, it does not need to be checked again. + */ + search_done = 1; + /* look for a conflicting sequence number */ di = btrfs_lookup_dir_index_item(trans, root, path, btrfs_ino(dir), - btrfs_inode_ref_index(eb, ref), - name, namelen, 0); + ref_index, name, namelen, 0); if (di && !IS_ERR(di)) { ret = drop_one_dir_item(trans, root, path, dir, di); BUG_ON(ret); @@ -932,17 +1007,25 @@ again: insert: /* insert our name */ - ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, - btrfs_inode_ref_index(eb, ref)); + ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, ref_index); BUG_ON(ret); btrfs_update_inode(trans, root, inode); out: - ref_ptr = (unsigned long)(ref + 1) + namelen; + ref_ptr = (unsigned long)(ref_ptr + sizeof(struct btrfs_inode_ref)) + namelen; kfree(name); - if (ref_ptr < ref_end) + if (ref_ptr < ref_end) { + ref = (struct btrfs_inode_ref *)ref_ptr; + namelen = btrfs_inode_ref_name_len(eb, ref); + name = kmalloc(namelen, GFP_NOFS); + BUG_ON(!name); + + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); + + ref_index = btrfs_inode_ref_index(eb, ref); goto again; + } /* finally write the back reference in the inode */ ret = overwrite_item(trans, root, path, eb, slot, key); @@ -965,25 +1048,103 @@ static int insert_orphan_item(struct btrfs_trans_handle *trans, return ret; } +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, + u64 start_off, struct btrfs_path *path, + struct btrfs_inode_extref **ret_ref, u64 *found_off) +{ + int ret, slot; + struct btrfs_key key, found_key; + struct btrfs_inode_extref *ref; + struct extent_buffer *leaf; + struct btrfs_item *item; + unsigned long ptr; -/* - * There are a few corners where the link count of the file can''t - * be properly maintained during replay. So, instead of adding - * lots of complexity to the log code, we just scan the backrefs - * for any file that has been through replay. - * - * The scan will update the link count on the inode to reflect the - * number of back refs found. If it goes down to zero, the iput - * will free the inode. - */ -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - struct inode *inode) + key.objectid = inode_objectid; + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); + key.offset = start_off; + + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + goto out; + + while (1) { + leaf = path->nodes[0]; + slot = path->slots[0]; + if (slot >= btrfs_header_nritems(leaf)) { + /* + * If the item at offset is not found, + * btrfs_search_slot will point us to the slot + * where it should be inserted. In our case + * that will be the slot directly before the + * next INODE_REF_KEY_V2 item. In the case + * that we''re pointing to the last slot in a + * leaf, we must move one leaf over. + */ + ret = btrfs_next_leaf(root, path); + if (ret) { + if (ret >= 1) + ret = -ENOENT; + break; + } + continue; + } + + item = btrfs_item_nr(leaf, slot); + btrfs_item_key_to_cpu(leaf, &found_key, slot); + + /* + * Check that we''re still looking at an extended ref key for + * this particular objectid. If we have different + * objectid or type then there are no more to be found + * in the tree and we can exit. + */ + ret = -ENOENT; + if (found_key.objectid != inode_objectid) + break; + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY) + break; + + ret = 0; + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); + ref = (struct btrfs_inode_extref *)ptr; + *ret_ref = ref; + if (found_off) + *found_off = found_key.offset + 1; + break; + } + +out: + return ret; +} + +static int count_inode_extrefs(struct btrfs_root *root, + struct inode *inode, struct btrfs_path *path) +{ + int ret; + unsigned int nlink = 0; + u64 inode_objectid = btrfs_ino(inode); + u64 offset = 0; + struct btrfs_inode_extref *ref; + + while (1) { + ret = btrfs_find_one_extref(root, inode_objectid, offset, path, + &ref, &offset); + if (ret) + break; + + nlink++; + offset++; + } + + return nlink; +} + +static int count_inode_refs(struct btrfs_root *root, + struct inode *inode, struct btrfs_path *path) { - struct btrfs_path *path; int ret; struct btrfs_key key; - u64 nlink = 0; + unsigned int nlink = 0; unsigned long ptr; unsigned long ptr_end; int name_len; @@ -993,10 +1154,6 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, key.type = BTRFS_INODE_REF_KEY; key.offset = (u64)-1; - path = btrfs_alloc_path(); - if (!path) - return -ENOMEM; - while (1) { ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) @@ -1030,6 +1187,48 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, btrfs_release_path(path); } btrfs_release_path(path); + btrfs_free_path(path); + + return nlink; +} + +/* + * There are a few corners where the link count of the file can''t + * be properly maintained during replay. So, instead of adding + * lots of complexity to the log code, we just scan the backrefs + * for any file that has been through replay. + * + * The scan will update the link count on the inode to reflect the + * number of back refs found. If it goes down to zero, the iput + * will free the inode. + */ +static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct inode *inode) +{ + struct btrfs_path *path; + int ret; + u64 nlink = 0; + u64 ino = btrfs_ino(inode); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + ret = count_inode_refs(root, inode, path); + btrfs_release_path(path); + if (ret < 0) + goto out; + + nlink = ret; + + ret = count_inode_extrefs(root, inode, path); + btrfs_release_path(path); + if (ret < 0) + goto out; + + nlink += ret; + if (nlink != inode->i_nlink) { set_nlink(inode, nlink); btrfs_update_inode(trans, root, inode); @@ -1045,9 +1244,10 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, ret = insert_orphan_item(trans, root, ino); BUG_ON(ret); } - btrfs_free_path(path); - return 0; +out: + btrfs_free_path(path); + return ret; } static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans, @@ -1689,6 +1889,10 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb, ret = add_inode_ref(wc->trans, root, log, path, eb, i, &key); BUG_ON(ret && ret != -ENOENT); + } else if (key.type == BTRFS_INODE_EXTREF_KEY) { + ret = add_inode_ref(wc->trans, root, log, path, + eb, i, &key); + BUG_ON(ret && ret != -ENOENT); } else if (key.type == BTRFS_EXTENT_DATA_KEY) { ret = replay_one_extent(wc->trans, root, path, eb, i, &key); diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h index 2270ac5..fd40ad5 100644 --- a/fs/btrfs/tree-log.h +++ b/fs/btrfs/tree-log.h @@ -49,4 +49,8 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans, int btrfs_log_new_name(struct btrfs_trans_handle *trans, struct inode *inode, struct inode *old_dir, struct dentry *parent); +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, + u64 start_off, struct btrfs_path *path, + struct btrfs_inode_extref **ret_ref, u64 *found_off); + #endif -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The iterate_irefs in backref.c is used to build path components from inode refs. I had to add a 2nd iterate function callback to handle extended refs. Both iterate callbacks eventually converge upon iref_to_path() which I was able to keep as one function with some small code to abstract away differences in the two disk structures. Signed-off-by: Mark Fasheh <mfasheh@suse.de> --- fs/btrfs/backref.c | 200 ++++++++++++++++++++++++++++++++++++++++++---------- fs/btrfs/backref.h | 4 +- 2 files changed, 165 insertions(+), 39 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 0436c12..f2b8952 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -22,6 +22,7 @@ #include "ulist.h" #include "transaction.h" #include "delayed-ref.h" +#include "tree-log.h" /* * this structure records all encountered refs on the way up to the root @@ -858,62 +859,75 @@ static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, } /* - * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements - * of the path are separated by ''/'' and the path is guaranteed to be - * 0-terminated. the path is only given within the current file system. - * Therefore, it never starts with a ''/''. the caller is responsible to provide - * "size" bytes in "dest". the dest buffer will be filled backwards. finally, - * the start point of the resulting string is returned. this pointer is within - * dest, normally. - * in case the path buffer would overflow, the pointer is decremented further - * as if output was written to the buffer, though no more output is actually - * generated. that way, the caller can determine how much space would be - * required for the path to fit into the buffer. in that case, the returned - * value will be smaller than dest. callers must check this! + * Given the parent objectid and name/name_len pairs of an inode ref + * (any version) this iterates to turn that information into a + * full filesystem path. elements of the path are separated by ''/'' and + * the path is guaranteed to be 0-terminated. the path is only given + * within the current file system. Therefore, it never starts with a + * ''/''. the caller is responsible to provide "size" bytes in + * "dest". the dest buffer will be filled backwards. finally, the + * start point of the resulting string is returned. this pointer is + * within dest, normally. in case the path buffer would overflow, the + * pointer is decremented further as if output was written to the + * buffer, though no more output is actually generated. that way, the + * caller can determine how much space would be required for the path + * to fit into the buffer. in that case, the returned value will be + * smaller than dest. callers must check this! */ static char *iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, - struct btrfs_inode_ref *iref, - struct extent_buffer *eb_in, u64 parent, - char *dest, u32 size) + int name_len, unsigned long name_off, + struct extent_buffer *eb_in, u64 parent, + char *dest, u32 size) { - u32 len; int slot; u64 next_inum; int ret; s64 bytes_left = size - 1; struct extent_buffer *eb = eb_in; struct btrfs_key found_key; + struct btrfs_inode_ref *iref; + struct btrfs_inode_extref *iref2; if (bytes_left >= 0) dest[bytes_left] = ''\0''; while (1) { - len = btrfs_inode_ref_name_len(eb, iref); - bytes_left -= len; + bytes_left -= name_len; if (bytes_left >= 0) read_extent_buffer(eb, dest + bytes_left, - (unsigned long)(iref + 1), len); + name_off, name_len); if (eb != eb_in) free_extent_buffer(eb); + + /* Ok, we have enough to find any refs to the parent inode. */ ret = inode_ref_info(parent, 0, fs_root, path, &found_key); - if (ret > 0) - ret = -ENOENT; - if (ret) - break; next_inum = found_key.offset; + if (ret == 0) { + slot = path->slots[0]; + eb = path->nodes[0]; + /* make sure we can use eb after releasing the path */ + if (eb != eb_in) + atomic_inc(&eb->refs); + btrfs_release_path(path); + iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); + + name_len = btrfs_inode_ref_name_len(eb, iref); + name_off = (unsigned long)(iref + 1); + } else { + ret = btrfs_find_one_extref(fs_root, parent, 0, path, + &iref2, NULL); + if (ret) + break; + + next_inum = btrfs_inode_extref_parent(eb, iref2); + name_off = (unsigned long)&iref2->name; + name_len = btrfs_inode_extref_name_len(eb, iref2); + } /* regular exit ahead */ if (parent == next_inum) break; - slot = path->slots[0]; - eb = path->nodes[0]; - /* make sure we can use eb after releasing the path */ - if (eb != eb_in) - atomic_inc(&eb->refs); - btrfs_release_path(path); - - iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); parent = next_inum; --bytes_left; if (bytes_left >= 0) @@ -1226,9 +1240,9 @@ int iterate_inodes_from_logical(u64 logical, struct btrfs_fs_info *fs_info, return ret; } -static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, - struct btrfs_path *path, - iterate_irefs_t *iterate, void *ctx) +static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root, + struct btrfs_path *path, + iterate_irefs_t *iterate, void *ctx) { int ret; int slot; @@ -1244,7 +1258,7 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, while (1) { ret = inode_ref_info(inum, parent ? parent+1 : 0, fs_root, path, - &found_key); + &found_key); if (ret < 0) break; if (ret) { @@ -1286,6 +1300,76 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, return ret; } +static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, + struct btrfs_path *path, + iterate_extrefs_t *iterate, void *ctx) +{ + int ret; + int slot; + u64 offset = 0; + u64 parent; + int found = 0; + struct extent_buffer *eb; + struct btrfs_item *item; + struct btrfs_inode_extref *iref2; + + while (1) { + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, + &offset); + if (ret < 0) + break; + if (ret) { + ret = found ? 0 : -ENOENT; + break; + } + ++found; + + slot = path->slots[0]; + eb = path->nodes[0]; + /* make sure we can use eb after releasing the path */ + atomic_inc(&eb->refs); + btrfs_release_path(path); + + item = btrfs_item_nr(eb, slot); + iref2 = btrfs_item_ptr(eb, slot, struct btrfs_inode_extref); + + parent = btrfs_inode_extref_parent(eb, iref2); + ret = iterate(parent, iref2, eb, ctx); + if (ret) { + free_extent_buffer(eb); + break; + } + + free_extent_buffer(eb); + offset++; + } + + btrfs_release_path(path); + + return ret; +} + +static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, + struct btrfs_path *path, + iterate_irefs_t *iterate, + iterate_extrefs_t *iterate2, void *ctx) +{ + int ret, found_refs = 0; + + ret = iterate_inode_refs(inum, fs_root, path, iterate, ctx); + if (ret && ret != -ENOENT) + return ret; + + if (ret != -ENOENT) + ++found_refs; + + ret = iterate_inode_extrefs(inum, fs_root, path, iterate2, ctx); + if (ret == -ENOENT && found_refs) + return 0; + + return ret; +} + /* * returns 0 if the path could be dumped (probably truncated) * returns <0 in case of an error @@ -1299,13 +1383,53 @@ static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, int i = ipath->fspath->elem_cnt; const int s_ptr = sizeof(char *); u32 bytes_left; + u32 name_len = btrfs_inode_ref_name_len(eb, iref); + + bytes_left = ipath->fspath->bytes_left > s_ptr ? + ipath->fspath->bytes_left - s_ptr : 0; + + fspath_min = (char *)ipath->fspath->val + (i + 1) * s_ptr; + fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, name_len, + (unsigned long)(iref + 1), eb, inum, fspath_min, + bytes_left); + if (IS_ERR(fspath)) + return PTR_ERR(fspath); + + if (fspath > fspath_min) { + ipath->fspath->val[i] = (u64)(unsigned long)fspath; + ++ipath->fspath->elem_cnt; + ipath->fspath->bytes_left = fspath - fspath_min; + } else { + ++ipath->fspath->elem_missed; + ipath->fspath->bytes_missing += fspath_min - fspath; + ipath->fspath->bytes_left = 0; + } + + return 0; +} + +/* + * returns 0 if the path could be dumped (probably truncated) + * returns <0 in case of an error + */ +static int inode_to_path2(u64 inum, struct btrfs_inode_extref *iref2, + struct extent_buffer *eb, void *ctx) +{ + struct inode_fs_paths *ipath = ctx; + char *fspath; + char *fspath_min; + int i = ipath->fspath->elem_cnt; + const int s_ptr = sizeof(char *); + u32 bytes_left; + u32 name_len = btrfs_inode_extref_name_len(eb, iref2); bytes_left = ipath->fspath->bytes_left > s_ptr ? ipath->fspath->bytes_left - s_ptr : 0; fspath_min = (char *)ipath->fspath->val + (i + 1) * s_ptr; - fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, iref, eb, - inum, fspath_min, bytes_left); + fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, name_len, + (unsigned long)&iref2->name, eb, inum, + fspath_min, bytes_left); if (IS_ERR(fspath)) return PTR_ERR(fspath); @@ -1339,7 +1463,7 @@ static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, int paths_from_inode(u64 inum, struct inode_fs_paths *ipath) { return iterate_irefs(inum, ipath->fs_root, ipath->btrfs_path, - inode_to_path, ipath); + inode_to_path, inode_to_path2, ipath); } /* diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h index d00dfa9..4634ed7 100644 --- a/fs/btrfs/backref.h +++ b/fs/btrfs/backref.h @@ -31,7 +31,9 @@ struct inode_fs_paths { typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root, void *ctx); typedef int (iterate_irefs_t)(u64 parent, struct btrfs_inode_ref *iref, - struct extent_buffer *eb, void *ctx); + struct extent_buffer *eb, void *ctx); +typedef int (iterate_extrefs_t)(u64 parent, struct btrfs_inode_extref *iref, + struct extent_buffer *eb, void *ctx); int inode_item_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, struct btrfs_path *path); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/05/2012 04:09 PM, Mark Fasheh wrote:> Currently btrfs has a limitation on the maximum number of hard > links an inode can have. Specifically, links are stored in an array > of ref items: > > struct btrfs_inode_ref { __le64 index; __le16 name_len; /* name > goes here */ } __attribute__ ((__packed__)); > > The ref arrays are found via key triple: > > (inode objectid, BTRFS_INODE_EXTREF_KEY, parent dir objectid) > > Since items can not exceed the size of a leaf, the total number of > links that can be stored for a given inode / parent dir pair is > limited to under 4k. This works fine for the most common case of > few to only a handful of links. Once the link count gets higher > however, we begin to return EMLINK. > > > The following patches fix this situation by introducing a new ref > item: > > struct btrfs_inode_extref { __le64 parent_objectid; __le64 index; > __le16 name_len; __u8 name[0]; /* name goes here */ } > __attribute__ ((__packed__)); > > Extended refs behave differently from ref arrays in several key > areas.Thanks for digging into this. It''s been heating up on the list lately.> Each extended refs is it''s own item so there is no ref array (and > therefore no limit on size). > > As a result, we must use a different addressing scheme. Extended > ref keys look like: > > (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) > > Where hash is defined as a function of the parent objectid and link > name.I think this is effective. It will essentially have the same properties as a dirent but seeds the hash at objectid instead of ~1.> This effectively fixes the limitation, though we have a slightly > less efficient packing of link data. To keep the best of both > worlds then, I implemented the following behavior: > > Extended refs don''t replace the existing ref array. An inode gets > an extended ref for a given link _only_ after the ref array has > been filled. So the most common cases shouldn''t actually see any > difference in performance or disk usage as they''ll never get to the > point where we''re using an extended ref. > > It''s important while reading the patches however that there''s still > the possibility that we can have a set of operations that grow out > an inode ref array (adding some extended refs) and then remove only > the refs in the array. I don''t really see this being common but > it''s a case we always have to consider when coding these changes. > > Right now there is a limitation for extrefs in that we''re not > handling the possibility of a hash collision. There are two ways I > see we can deal with this: > > We can use a 56-bit hash and keep a generation counter in the lower > 8 bits of the offset field. The cost would be an additional tree > search (between offset <hash>00 and <hash>FF) if we don''t find > exactly the name we were looking for. > > An alternative solution to dealing with collisions could be to > emulate the dir-item insertion code - specifically something like > insert_with_overflow() which will stuff multiple items under one > key. I tend to prefer the idea ofI vote for this option. The code for insert_with_overflow is already well tested and anything that will generate a collision in dirent insertion will generate a collision for the backref insertion. The dirent structure is larger than the extref structure, so there should always be space to match the dirent leaf so that a failure occurs there first. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJPfgr5AAoJEB57S2MheeWyt6cQAJPHUlCtB3+huJTodA7ow+jy 3WPhrbTPYME6lLpC/JQH8XbogKL1IqLbsvl9M3KzHMZAJ4fRzNJXmMCFgIou4cgu v2cnNwkU1r5LJF/M3HMk1nhxABCeSONNTFqDbEp/eiTbI7X/UsM6q0vdPpj0vYih 20kWZhazmgx4pUPrtldKU+k91jjsZRoZyn8Bx6lEYPKIx1RQuBDPDH8q3ep5og2d OQHDfMVNEJJ9Mz9lv+BZDqx/Q2Om8wyaM5GfjhtSocT+XrpTT4tC8FnKiUuOE1Ej dy1Td43t+cCWvglGRDFj5I6ObfW7x4aUDTizo1hMUuGEQFQVc3vGY3OtDuqPxwoV XS/4XRR3GU2cmsyjTky4Twa+81RF84cUl05+gM9VMaD7BJX60kIp1SXnE/TpyVUd oVgZsgkV75R4XmjBp1zDrDRHTzpsgxm+zaVhSTxOKWhnV1bWLZ36SXw/1LRr0IVP K+0xbnds7WWoRJNMSpJTBgAr7N4A5QEYvIUSQEO4UpgX4EXHYkfWstvvYNx1vUOT HiHtS3ZKbrfzcmD1ZNSOPAcxold5BsVisuBwoyJnvOyCFeCi+Q+S0W8wA0kI6iS8 57nWYZfGelKLm/ATL8vPkpj6BCkFfhAK8neNdd80v1SpkUJOZr5tMTxaYO1r3QvC T0e3jAC32pgrJ/2fByT+ =dx+I -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/06/2012 04:09 AM, Mark Fasheh wrote:> Currently btrfs has a limitation on the maximum number of hard links an > inode can have. Specifically, links are stored in an array of ref > items: > > struct btrfs_inode_ref { > __le64 index; > __le16 name_len; > /* name goes here */ > } __attribute__ ((__packed__)); > > The ref arrays are found via key triple: > > (inode objectid, BTRFS_INODE_EXTREF_KEY, parent dir objectid) > > Since items can not exceed the size of a leaf, the total number of links > that can be stored for a given inode / parent dir pair is limited to under > 4k. This works fine for the most common case of few to only a handful of > links. Once the link count gets higher however, we begin to return EMLINK. > > > The following patches fix this situation by introducing a new ref item: > > struct btrfs_inode_extref { > __le64 parent_objectid; > __le64 index; > __le16 name_len; > __u8 name[0]; > /* name goes here */ > } __attribute__ ((__packed__)); > > Extended refs behave differently from ref arrays in several key areas. > > Each extended refs is it''s own item so there is no ref array (and > therefore no limit on size). > > As a result, we must use a different addressing scheme. Extended ref keys > look like: > > (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) > > Where hash is defined as a function of the parent objectid and link name. > > This effectively fixes the limitation, though we have a slightly less > efficient packing of link data. To keep the best of both worlds then, I > implemented the following behavior: > > Extended refs don''t replace the existing ref array. An inode gets an > extended ref for a given link _only_ after the ref array has been filled. So > the most common cases shouldn''t actually see any difference in performance > or disk usage as they''ll never get to the point where we''re using an > extended ref. > > It''s important while reading the patches however that there''s still the > possibility that we can have a set of operations that grow out an inode ref > array (adding some extended refs) and then remove only the refs in the > array. I don''t really see this being common but it''s a case we always have > to consider when coding these changes. > > Right now there is a limitation for extrefs in that we''re not handling the > possibility of a hash collision. There are two ways I see we can deal with > this: > > We can use a 56-bit hash and keep a generation counter in the lower 8 > bits of the offset field. The cost would be an additional tree search > (between offset <hash>00 and <hash>FF) if we don''t find exactly the name we > were looking for. > > An alternative solution to dealing with collisions could be to emulate the > dir-item insertion code - specifically something like insert_with_overflow() > which will stuff multiple items under one key. I tend to prefer the idea of > simply including a generation in the key offset however since it maintains > the 1:1 relationship of keys to names which turns out to be much nicer to > code for in my honest opinion. Also none of the code which iterates the tree > looking for refs would have to change as the only difference is in the key > offset and not in the actual item structure. > > > Testing wise, the patches are in an intermediate state. I''ve debugged a fair > bit but I''m certain there''s gremlins lurking in there. The basic namespace > operations work well enough (link, unlink, etc). I''ve done light testing of > my changes in backref.c by exercising BTRFS_IOC_INO_PATHS. The changes in > tree-log.c need the most review and testing - I haven''t really figured out a > great way to exercise the code in tree-log yet (suggestions would be > great!). >For the log recover test, I used to sysrq+b to make sure our log remains on disk. Will also test this patchset sooner or later. thanks, liubo> > Finally, these patches are based off Linux v3.3. > --Mark > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/06/2012 09:24 AM, Liu Bo wrote:> On 04/06/2012 04:09 AM, Mark Fasheh wrote: >> Currently btrfs has a limitation on the maximum number of hard links an >> inode can have. Specifically, links are stored in an array of ref >> items: >> >> struct btrfs_inode_ref { >> __le64 index; >> __le16 name_len; >> /* name goes here */ >> } __attribute__ ((__packed__)); >> >> The ref arrays are found via key triple: >> >> (inode objectid, BTRFS_INODE_EXTREF_KEY, parent dir objectid) >> >> Since items can not exceed the size of a leaf, the total number of links >> that can be stored for a given inode / parent dir pair is limited to under >> 4k. This works fine for the most common case of few to only a handful of >> links. Once the link count gets higher however, we begin to return EMLINK. >> >> >> The following patches fix this situation by introducing a new ref item: >> >> struct btrfs_inode_extref { >> __le64 parent_objectid; >> __le64 index; >> __le16 name_len; >> __u8 name[0]; >> /* name goes here */ >> } __attribute__ ((__packed__)); >> >> Extended refs behave differently from ref arrays in several key areas. >> >> Each extended refs is it''s own item so there is no ref array (and >> therefore no limit on size). >> >> As a result, we must use a different addressing scheme. Extended ref keys >> look like: >> >> (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) >> >> Where hash is defined as a function of the parent objectid and link name. >> >> This effectively fixes the limitation, though we have a slightly less >> efficient packing of link data. To keep the best of both worlds then, I >> implemented the following behavior: >> >> Extended refs don''t replace the existing ref array. An inode gets an >> extended ref for a given link _only_ after the ref array has been filled. So >> the most common cases shouldn''t actually see any difference in performance >> or disk usage as they''ll never get to the point where we''re using an >> extended ref. >> >> It''s important while reading the patches however that there''s still the >> possibility that we can have a set of operations that grow out an inode ref >> array (adding some extended refs) and then remove only the refs in the >> array. I don''t really see this being common but it''s a case we always have >> to consider when coding these changes. >> >> Right now there is a limitation for extrefs in that we''re not handling the >> possibility of a hash collision. There are two ways I see we can deal with >> this: >> >> We can use a 56-bit hash and keep a generation counter in the lower 8 >> bits of the offset field. The cost would be an additional tree search >> (between offset <hash>00 and <hash>FF) if we don''t find exactly the name we >> were looking for. >> >> An alternative solution to dealing with collisions could be to emulate the >> dir-item insertion code - specifically something like insert_with_overflow() >> which will stuff multiple items under one key. I tend to prefer the idea of >> simply including a generation in the key offset however since it maintains >> the 1:1 relationship of keys to names which turns out to be much nicer to >> code for in my honest opinion. Also none of the code which iterates the tree >> looking for refs would have to change as the only difference is in the key >> offset and not in the actual item structure. >> >> >> Testing wise, the patches are in an intermediate state. I''ve debugged a fair >> bit but I''m certain there''s gremlins lurking in there. The basic namespace >> operations work well enough (link, unlink, etc). I''ve done light testing of >> my changes in backref.c by exercising BTRFS_IOC_INO_PATHS. The changes in >> tree-log.c need the most review and testing - I haven''t really figured out a >> great way to exercise the code in tree-log yet (suggestions would be >> great!). >> > > For the log recover test, I used to sysrq+b to make sure our log remains on disk. > > Will also test this patchset sooner or later. >It Works fine in normal mode except we need to note people to modify their btrfs-progs with that incompat flag at the first step ;) However, for log recover, I use the following script: $ touch /mnt/btrfs/foobar; $ ./fsync_self /mnt/btrfs/foobar; (fsync_self is a wrapper of fsync() written by myself) $ for i in `seq 1 1 300`; do ln /mnt/btrfs/foobar /mnt/btrfs/foobar$i; ./fsync_self /mnt/btrfs/foobar$i; done; $ echo b > /proc/sysrq-trigger when we come back, $ mount disk /mnt/btrfs and it hits a warning and a hang, the dmesg log shows: Btrfs loaded device fsid 85811dec-dd03-44f1-a8e2-005a67c6b7f5 devid 1 transid 5 /dev/sdb7 btrfs: disk space caching is enabled Btrfs detected SSD devices, enabling SSD mode ------------[ cut here ]------------ WARNING: at fs/btrfs/ctree.c:1677 btrfs_search_slot+0x941/0x960 [btrfs]() Hardware name: QiTianM7150 Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ebtable_nat ebtables ipt_REJECT ip_tables bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_ helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs] Pid: 2323, comm: mount Tainted: G O 3.4.0-rc1 #8 Call Trace: [<ffffffff8104d59f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8104d5fa>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0715071>] btrfs_search_slot+0x941/0x960 [btrfs] [<ffffffffa07264de>] btrfs_lookup_dir_index_item+0x4e/0x90 [btrfs] [<ffffffffa076f139>] add_inode_ref+0x4b9/0x880 [btrfs] [<ffffffffa0771fc7>] replay_one_buffer+0x2a7/0x3b0 [btrfs] [<ffffffffa074700d>] ? btrfs_token_key_generation+0x5d/0xe0 [btrfs] [<ffffffffa076c31a>] walk_down_log_tree+0x23a/0x410 [btrfs] [<ffffffffa076c825>] walk_log_tree+0xb5/0x210 [btrfs] [<ffffffffa0770669>] btrfs_recover_log_trees+0x229/0x3e0 [btrfs] [<ffffffffa0771d20>] ? replay_one_dir_item+0xf0/0xf0 [btrfs] [<ffffffffa0730b08>] open_ctree+0x1598/0x1ae0 [btrfs] [<ffffffffa070bc94>] btrfs_mount+0x474/0x560 [btrfs] [<ffffffff811278be>] ? pcpu_next_pop+0x4e/0x70 [<ffffffff81169ab3>] mount_fs+0x43/0x1a0 [<ffffffff81129050>] ? __alloc_percpu+0x10/0x20 [<ffffffff8118406a>] vfs_kern_mount+0x6a/0xf0 [<ffffffff81184462>] do_kern_mount+0x52/0x110 [<ffffffff811f3f98>] ? security_capable+0x18/0x20 [<ffffffff81186355>] do_mount+0x255/0x7c0 [<ffffffff8112390b>] ? memdup_user+0x4b/0x90 [<ffffffff811239ab>] ? strndup_user+0x5b/0x80 [<ffffffff81186950>] sys_mount+0x90/0xe0 [<ffffffff814fab69>] system_call_fastpath+0x16/0x1b ---[ end trace d5fe92190ef227d6 ]--- SysRq : Show Blocked State task PC stack pid father mount D ffffffff81610340 0 2323 2254 0x00000080 ffff880075989408 0000000000000082 ffff880076c454a0 0000000000013440 ffff880075989fd8 ffff880075988010 0000000000013440 0000000000013440 ffff880075989fd8 0000000000013440 ffff88007a45cb30 ffff880076c454a0 Call Trace: [<ffffffff814f2029>] schedule+0x29/0x70 [<ffffffffa076af25>] btrfs_tree_lock+0xc5/0x2a0 [btrfs] [<ffffffff8106f850>] ? wake_up_bit+0x40/0x40 [<ffffffffa070e16b>] btrfs_lock_root_node+0x3b/0x50 [btrfs] [<ffffffffa0714e88>] btrfs_search_slot+0x758/0x960 [btrfs] [<ffffffffa0715b4d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs] [<ffffffffa07268a3>] insert_with_overflow+0x43/0x110 [btrfs] [<ffffffffa0726a4a>] btrfs_insert_dir_item+0xda/0x210 [btrfs] [<ffffffff8121f02b>] ? chksum_update+0x1b/0x30 [<ffffffffa0737f74>] btrfs_add_link+0xe4/0x2f0 [btrfs] [<ffffffffa0755bb4>] ? free_extent_buffer+0x34/0x80 [btrfs] [<ffffffffa076f22d>] add_inode_ref+0x5ad/0x880 [btrfs] [<ffffffffa0771fc7>] replay_one_buffer+0x2a7/0x3b0 [btrfs] [<ffffffffa074700d>] ? btrfs_token_key_generation+0x5d/0xe0 [btrfs] [<ffffffffa076c31a>] walk_down_log_tree+0x23a/0x410 [btrfs] [<ffffffffa076c825>] walk_log_tree+0xb5/0x210 [btrfs] [<ffffffffa0770669>] btrfs_recover_log_trees+0x229/0x3e0 [btrfs] [<ffffffffa0771d20>] ? replay_one_dir_item+0xf0/0xf0 [btrfs] [<ffffffffa0730b08>] open_ctree+0x1598/0x1ae0 [btrfs] [<ffffffffa070bc94>] btrfs_mount+0x474/0x560 [btrfs] [<ffffffff811278be>] ? pcpu_next_pop+0x4e/0x70 [<ffffffff81169ab3>] mount_fs+0x43/0x1a0 [<ffffffff81129050>] ? __alloc_percpu+0x10/0x20 [<ffffffff8118406a>] vfs_kern_mount+0x6a/0xf0 [<ffffffff81184462>] do_kern_mount+0x52/0x110 [<ffffffff811f3f98>] ? security_capable+0x18/0x20 [<ffffffff81186355>] do_mount+0x255/0x7c0 [<ffffffff8112390b>] ? memdup_user+0x4b/0x90 [<ffffffff811239ab>] ? strndup_user+0x5b/0x80 [<ffffffff81186950>] sys_mount+0x90/0xe0 [<ffffffff814fab69>] system_call_fastpath+0x16/0x1b btrfs-transacti D ffffffff81610340 0 2338 2 0x00000080 ffff880079df1ae0 0000000000000046 ffff8800372ca100 0000000000013440 ffff880079df1fd8 ffff880079df0010 0000000000013440 0000000000013440 ffff880079df1fd8 0000000000013440 ffffffff81a13020 ffff8800372ca100 Call Trace: [<ffffffff814f2029>] schedule+0x29/0x70 [<ffffffffa076af25>] btrfs_tree_lock+0xc5/0x2a0 [btrfs] [<ffffffff8106f850>] ? wake_up_bit+0x40/0x40 [<ffffffffa070e16b>] btrfs_lock_root_node+0x3b/0x50 [btrfs] [<ffffffffa0714e88>] btrfs_search_slot+0x758/0x960 [btrfs] [<ffffffffa072884f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffff814f08ce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0784931>] btrfs_update_delayed_inode+0x71/0x140 [btrfs] [<ffffffffa0784e0a>] btrfs_run_delayed_items+0x12a/0x160 [btrfs] [<ffffffffa0732aef>] btrfs_commit_transaction+0x36f/0xa70 [btrfs] [<ffffffffa0733592>] ? start_transaction+0x92/0x320 [btrfs] [<ffffffff8106f850>] ? wake_up_bit+0x40/0x40 [<ffffffffa072e0fb>] transaction_kthread+0x26b/0x2e0 [btrfs] [<ffffffffa072de90>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffffa072de90>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffff8106f1ae>] kthread+0x9e/0xb0 [<ffffffff814fbe64>] kernel_thread_helper+0x4/0x10 [<ffffffff8106f110>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814fbe60>] ? gs_change+0x13/0x13> thanks, > liubo > >> Finally, these patches are based off Linux v3.3. >> --Mark >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jeff, On 05.04.2012 23:13, Jeff Mahoney wrote:>> As a result, we must use a different addressing scheme. Extended >> ref keys look like: > >> (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) > >> Where hash is defined as a function of the parent objectid and link >> name. > > I think this is effective. It will essentially have the same > properties as a dirent but seeds the hash at objectid instead of ~1.The objectid is already part of the key. What''s the point in seeding it to crc32 instead of ~1?>> Right now there is a limitation for extrefs in that we''re not >> handling the possibility of a hash collision. There are two ways I >> see we can deal with this: > >> We can use a 56-bit hash and keep a generation counter in the lower >> 8 bits of the offset field. The cost would be an additional tree >> search (between offset <hash>00 and <hash>FF) if we don''t find >> exactly the name we were looking for. > >> An alternative solution to dealing with collisions could be to >> emulate the dir-item insertion code - specifically something like >> insert_with_overflow() which will stuff multiple items under one >> key. I tend to prefer the idea of > > I vote for this option.Having more than one way to deal with hash collisions for keys doesn''t feel right. Let''s find out which works best and use it in both places. If the current insert_with_overflow style is best, please use that for extrefs as well. If it isn''t, can we replace it with something that works for both?> The code for insert_with_overflow is already > well testedI''m looking forward to seeing tests or references to the archives discussing this. Do you have a link for me?> and anything that will generate a collision in dirent > insertion will generate a collision for the backref insertion. The > dirent structure is larger than the extref structure, so there should > always be space to match the dirent leaf so that a failure occurs > there first.I agree with that bit. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11.04.2012 15:11, Jan Schmidt wrote:> Hi Jeff, > > On 05.04.2012 23:13, Jeff Mahoney wrote: >>> As a result, we must use a different addressing scheme. Extended >>> ref keys look like: >> >>> (inode objectid, BTRFS_INODE_EXTREF_KEY, hash) >> >>> Where hash is defined as a function of the parent objectid and link >>> name. >> >> I think this is effective. It will essentially have the same >> properties as a dirent but seeds the hash at objectid instead of ~1. > > The objectid is already part of the key. What''s the point in seeding it > to crc32 instead of ~1?While reading the patch set, I''m just seeing that the first objectid is the inode''s obectid while the latter is the parent''s objectid. I''m fine with that. Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05.04.2012 22:09, Mark Fasheh wrote:> This patch adds basic support for extended inode refs. This includes support > for link and unlink of the refs, which basically gets us support for rename > as well. > > Inode creation does not need changing - extended refs are only added after > the ref array is full. > > Signed-off-by: Mark Fasheh <mfasheh@suse.de> > --- > fs/btrfs/ctree.h | 50 +++++++++-- > fs/btrfs/inode-item.c | 244 +++++++++++++++++++++++++++++++++++++++++++++++-- > fs/btrfs/inode.c | 20 ++-- > 3 files changed, 288 insertions(+), 26 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 80b6486..5fc77ee 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -143,6 +143,13 @@ struct btrfs_ordered_sum; > */ > #define BTRFS_NAME_LEN 255 > > +/* > + * Theoretical limit is larger, but we keep this down to a sane > + * value. That should limit greatly the possibility of collisions on > + * inode ref items. > + */ > +#define BTRFS_LINK_MAX 65535UDo we really want an artificial limit like that?> + > /* 32 bytes in various csum fields */ > #define BTRFS_CSUM_SIZE 32 > > @@ -462,13 +469,16 @@ struct btrfs_super_block { > #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS (1ULL << 2) > #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO (1ULL << 3) > > +#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) > + > #define BTRFS_FEATURE_COMPAT_SUPP 0ULL > #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL > #define BTRFS_FEATURE_INCOMPAT_SUPP \ > (BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \ > BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \ > BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ > - BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO) > + BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \ > + BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) > > /* > * A leaf is full of items. offset and size tell us where to find > @@ -615,6 +625,14 @@ struct btrfs_inode_ref { > /* name goes here */ > } __attribute__ ((__packed__)); > > +struct btrfs_inode_extref { > + __le64 parent_objectid; > + __le64 index; > + __le16 name_len; > + __u8 name[0]; > + /* name goes here */ > +} __attribute__ ((__packed__)); > + > struct btrfs_timespec { > __le64 sec; > __le32 nsec; > @@ -1400,6 +1418,7 @@ struct btrfs_ioctl_defrag_range_args { > */ > #define BTRFS_INODE_ITEM_KEY 1 > #define BTRFS_INODE_REF_KEY 12 > +#define BTRFS_INODE_EXTREF_KEY 13 > #define BTRFS_XATTR_ITEM_KEY 24 > #define BTRFS_ORPHAN_ITEM_KEY 48 > /* reserve 2-15 close to the inode for later flexibility */ > @@ -1701,6 +1720,13 @@ BTRFS_SETGET_STACK_FUNCS(block_group_flags, > BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16); > BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64); > > +/* struct btrfs_inode_extref */ > +BTRFS_SETGET_FUNCS(inode_extref_parent, struct btrfs_inode_extref, > + parent_objectid, 64); > +BTRFS_SETGET_FUNCS(inode_extref_name_len, struct btrfs_inode_extref, > + name_len, 16); > +BTRFS_SETGET_FUNCS(inode_extref_index, struct btrfs_inode_extref, index, 64); > + > /* struct btrfs_inode_item */ > BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 64); > BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64); > @@ -2791,12 +2817,12 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > struct btrfs_root *root, > const char *name, int name_len, > u64 inode_objectid, u64 ref_objectid, u64 *index); > -struct btrfs_inode_ref * > -btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, > - struct btrfs_root *root, > - struct btrfs_path *path, > - const char *name, int name_len, > - u64 inode_objectid, u64 ref_objectid, int mod); > +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + struct btrfs_path *path, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, int mod, > + u64 *ret_index); > int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans, > struct btrfs_root *root, > struct btrfs_path *path, u64 objectid); > @@ -2804,6 +2830,16 @@ int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root > *root, struct btrfs_path *path, > struct btrfs_key *location, int mod); > > +struct btrfs_inode_extref * > +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + struct btrfs_path *path, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, int ins_len, > + int cow); > + > +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len); > + > /* file-item.c */ > int btrfs_del_csums(struct btrfs_trans_handle *trans, > struct btrfs_root *root, u64 bytenr, u64 len); > diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c > index baa74f3..cc5e854 100644 > --- a/fs/btrfs/inode-item.c > +++ b/fs/btrfs/inode-item.c > @@ -16,6 +16,7 @@ > * Boston, MA 021110-1307, USA. > */ > > +#include <linux/crc32c.h> > #include "ctree.h" > #include "disk-io.h" > #include "transaction.h" > @@ -49,18 +50,56 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name, > return 0; > } > > -struct btrfs_inode_ref * > +static int find_name_in_ext_backref(struct btrfs_path *path, const char *name, > + int name_len, struct btrfs_inode_extref **ref_ret) > +{ > + struct extent_buffer *leaf; > + struct btrfs_inode_extref *ref; > + unsigned long ptr; > + unsigned long name_ptr; > + u32 item_size; > + > + leaf = path->nodes[0]; > + item_size = btrfs_item_size_nr(leaf, path->slots[0]); > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > + ref = (struct btrfs_inode_extref *)ptr; > + > + /* > + * There isn''t actually anything to "search" in an extended > + * backref - one name is directly attached to each > + * item. Instead what we''re really doing here is some sort of > + * sanity check - does the name exist where it should have > + * been found. If all is well, we''ll return success and the > + * inode ref object. > + */In that case, the name is not a good choice. However, if we change how we deal with hash collisions, the name might be right and the comment might go away as we really look through more than one item. If we leave it like this, please rename that function.> + name_ptr = (unsigned long)(&ref->name); > + if (btrfs_inode_extref_name_len(leaf, ref) == name_len > + && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { > + *ref_ret = ref; > + return 1; > + } > + return 0; > +} > + > +/* > + * Figure the key offset of an extended inode ref > + */ > +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len)I''d suggest to call this one btrfs_extref_hash and move it to fs/btrfs/hash.h. That way, we can also drop the include for <linux/crc32c.h> above.> +{ > + return (u64) crc32c(parent_objectid, name, name_len); > +} > + > +static struct btrfs_inode_ref * > btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, > - struct btrfs_root *root, > - struct btrfs_path *path, > - const char *name, int name_len, > - u64 inode_objectid, u64 ref_objectid, int mod) > + struct btrfs_root *root, > + struct btrfs_path *path, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, int ins_len, > + int cow) > { > + int ret; > struct btrfs_key key; > struct btrfs_inode_ref *ref; > - int ins_len = mod < 0 ? -1 : 0; > - int cow = mod != 0; > - int ret; > > key.objectid = inode_objectid; > key.type = BTRFS_INODE_REF_KEY; > @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, > return ref; > } > > -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > +struct btrfs_inode_extref * > +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + struct btrfs_path *path, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, int ins_len, > + int cow) > +{ > + int ret; > + struct btrfs_key key; > + struct btrfs_inode_extref *ref;^^^ Please use some other identifier than the one we use for struct btrfs_inode_ref. Suggestion: extref or eref.> + > + key.objectid = inode_objectid; > + key.type = BTRFS_INODE_EXTREF_KEY; > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);^^^^^^^^^^^^^^^^^^^^ Get''s clearer when that is called btrfs_extref_hash.> + > + ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); > + if (ret < 0) > + return ERR_PTR(ret); > + if (ret > 0) > + return NULL; > + if (!find_name_in_ext_backref(path, name, name_len, &ref)) > + return NULL; > + return ref; > +} > + > +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + struct btrfs_path *path, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, int mod, > + u64 *ret_index) > +{ > + struct btrfs_inode_ref *ref1; > + struct btrfs_inode_extref *ref2;^^^^ Please, use "ref" and ("eref" or "extref").> + int ins_len = mod < 0 ? -1 : 0; > + int cow = mod != 0; > + > + ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len, > + inode_objectid, ref_objectid, ins_len, > + cow); > + if (IS_ERR(ref1)) > + return PTR_ERR(ref1); > + > + if (ref1 != NULL) { > + *ret_index = btrfs_inode_ref_index(path->nodes[0], ref1); > + return 0; > + } > + > + btrfs_release_path(path); > + > + ref2 = btrfs_lookup_inode_extref(trans, root, path, name, > + name_len, inode_objectid, > + ref_objectid, ins_len, cow); > + if (IS_ERR(ref2)) > + return PTR_ERR(ref2); > + > + if (ref2) { > + *ret_index = btrfs_inode_extref_index(path->nodes[0], ref2); > + return 0; > + } > + > + return -ENOENT; > +} > + > +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans, > struct btrfs_root *root, > const char *name, int name_len, > u64 inode_objectid, u64 ref_objectid, u64 *index) > { > struct btrfs_path *path; > struct btrfs_key key; > + struct btrfs_inode_extref *ref2;^^^^> + struct extent_buffer *leaf; > + int ret; > + > + key.objectid = inode_objectid; > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + path->leave_spinning = 1; > + > + ret = btrfs_search_slot(trans, root, &key, path, -1, 1); > + if (ret > 0) { > + ret = -ENOENT; > + goto out; > + } else if (ret < 0) { > + goto out; > + }Clearer (to me): if (ret > 0) ret = -ENOENT; if (ret < 0) goto out;> + > + /* > + * Sanity check - did we find the right item for this name? > + * This should always succeed so error here will make the FS > + * readonly. > + */ > + if (!find_name_in_ext_backref(path, name, name_len, &ref2)) { > + btrfs_std_error(root->fs_info, -ENOENT); > + ret = -EROFS;Simply returning -EROFS doesn''t make the fs read-only, does it? I''d suggest to return -EIO and drop the comment above.> + goto out; > + } > + > + leaf = path->nodes[0]; > + if (index) > + *index = btrfs_inode_extref_index(leaf, ref2); > + > + ret = btrfs_del_item(trans, root, path); > + > +out: > + btrfs_free_path(path); > + > + return ret; > +} > + > +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, u64 *index) > +{ > + struct btrfs_path *path; > + struct btrfs_key key; > struct btrfs_inode_ref *ref; > struct extent_buffer *leaf; > unsigned long ptr; > unsigned long item_start; > u32 item_size; > u32 sub_item_len; > - int ret; > + int ret, search_ext_refs = 0;^^^^^^^^^^^^^^^^^^^^^ Documentation/CodingStyle: -- To this end, use just one data declaration per line (no commas for multiple data declarations). -- You do this multiple times in your whole patch set.> int del_len = name_len + sizeof(*ref); > > key.objectid = inode_objectid; > @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > ret = btrfs_search_slot(trans, root, &key, path, -1, 1); > if (ret > 0) { > ret = -ENOENT; > + search_ext_refs = 1; > goto out; > } else if (ret < 0) { > goto out; > } > if (!find_name_in_backref(path, name, name_len, &ref)) { > ret = -ENOENT; > + search_ext_refs = 1; > goto out; > } > leaf = path->nodes[0]; > @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > item_size - sub_item_len, 1); > out: > btrfs_free_path(path); > + > + if (search_ext_refs) {As far as I see it, you can drop search_ext_refs and simply go this way on ret == -ENOENT.> + /*^ Catched by accident: trailing whitespace.> + * No refs were found, or we could not find the > + * name in our ref array. Find and remove the extended > + * inode ref then. > + */ > + return btrfs_del_inode_extref(trans, root, name, name_len, > + inode_objectid, ref_objectid, index); > + } > + > + return ret; > +} > + > +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + const char *name, int name_len, > + u64 inode_objectid, u64 ref_objectid, u64 index)Are we missing a check against BTRFS_LINK_MAX in this function?> +{ > + struct btrfs_path *path; > + struct btrfs_key key; > + struct btrfs_inode_extref *ref;extref> + unsigned long ptr; > + int ret; > + int ins_len = name_len + sizeof(*ref); > + > + key.objectid = inode_objectid; > + key.type = BTRFS_INODE_EXTREF_KEY; > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + path->leave_spinning = 1; > + ret = btrfs_insert_empty_item(trans, root, path, &key, > + ins_len); > + if (ret < 0) > + goto out; > + > + ref = btrfs_item_ptr(path->nodes[0], path->slots[0], > + struct btrfs_inode_extref); > + > + btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len); > + btrfs_set_inode_extref_index(path->nodes[0], ref, index); > + btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid); > + > + ptr = (unsigned long)&ref->name; > + write_extent_buffer(path->nodes[0], name, ptr, name_len); > + btrfs_mark_buffer_dirty(path->nodes[0]); > + > +out: > + btrfs_free_path(path); > return ret; > } > > @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans, > > out: > btrfs_free_path(path); > + > + if (ret == -EMLINK) { > + struct btrfs_super_block *disk_super = root->fs_info->super_copy; > + /* We ran out of space in the ref array. Need to > + * add an extended ref. */ > + if (btrfs_super_incompat_flags(disk_super) > + & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) > + ret = btrfs_insert_inode_extref(trans, root, name, > + name_len, > + inode_objectid, > + ref_objectid, index); > + } > + > return ret; > } > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 892b347..0ca525d 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, > struct btrfs_trans_handle *trans; > struct btrfs_root *root = BTRFS_I(dir)->root; > struct btrfs_path *path; > - struct btrfs_inode_ref *ref; > struct btrfs_dir_item *di; > struct inode *inode = dentry->d_inode; > u64 index; > @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, > } > btrfs_release_path(path); > > - ref = btrfs_lookup_inode_ref(trans, root, path, > - dentry->d_name.name, dentry->d_name.len, > - ino, dir_ino, 0); > - if (IS_ERR(ref)) { > - err = PTR_ERR(ref); > + ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name, > + dentry->d_name.len, ino, dir_ino, 0, &index);This line is >80 chars. Please use scripts/checkpatch.pl to catch those.> + if (ret) { > + err = ret; > goto out; > } > - BUG_ON(!ref); > + > if (check_path_shared(root, path)) > goto out; > - index = btrfs_inode_ref_index(path->nodes[0], ref); > + > btrfs_release_path(path); > > /* > @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, > btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY); > key[0].offset = 0; > > + /* > + * Start new inodes with a V1 ref. This is slightly more > + * efficient for small numbers of hard links. > + */I''d drop that comment. extref is no "V2" kind of ref. But you can leave it in if you feel it helps anyone later.> key[1].objectid = objectid; > btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY); > key[1].offset = ref_objectid; > @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, > if (root->objectid != BTRFS_I(inode)->root->objectid) > return -EXDEV; > > - if (inode->i_nlink == ~0U) > + if (inode->i_nlink >= BTRFS_LINK_MAX) > return -EMLINK; > > err = btrfs_set_inode_index(dir, &index);Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05.04.2012 22:09, Mark Fasheh wrote:> Teach tree-log.c about extended inode refs. In particular, we have to adjust > the behavior of inode ref replay as well as log tree recovery to account for > the existence of extended refs. > > Signed-off-by: Mark Fasheh <mfasheh@suse.de> > --- > fs/btrfs/tree-log.c | 320 +++++++++++++++++++++++++++++++++++++++++--------- > fs/btrfs/tree-log.h | 4 + > 2 files changed, 266 insertions(+), 58 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index 966cc74..d69b07a 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -23,6 +23,7 @@ > #include "disk-io.h" > #include "locking.h" > #include "print-tree.h" > +#include "backref.h"So now tree-log.c includes backref.h and backref.c includes tree-log.h, this is not a problem by itself, but I''d try to avoid such dependencies. I''d put find_one_extref over to backref.c to solve this, also because there it would be closer to inode_ref_info (which does the same for INODE_REFs.> #include "compat.h" > #include "tree-log.h" > > @@ -748,6 +749,7 @@ static noinline int backref_in_log(struct btrfs_root *log, > { > struct btrfs_path *path; > struct btrfs_inode_ref *ref; > + struct btrfs_inode_extref *extref;^^^^^^ :-)> unsigned long ptr; > unsigned long ptr_end; > unsigned long name_ptr; > @@ -764,8 +766,24 @@ static noinline int backref_in_log(struct btrfs_root *log, > if (ret != 0) > goto out; > > - item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]); > ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); > + > + if (key->type == BTRFS_INODE_EXTREF_KEY) { > + extref = (struct btrfs_inode_extref *)ptr; > + > + found_name_len = btrfs_inode_extref_name_len(path->nodes[0], > + extref); > + if (found_name_len == namelen) { > + name_ptr = (unsigned long)&extref->name; > + ret = memcmp_extent_buffer(path->nodes[0], name, > + name_ptr, namelen); > + if (ret == 0) > + match = 1; > + } > + goto out; > + } > + > + item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]); > ptr_end = ptr + item_size; > while (ptr < ptr_end) { > ref = (struct btrfs_inode_ref *)ptr; > @@ -786,7 +804,6 @@ out: > return match; > } > > - > /* > * replay one inode back reference item found in the log tree. > * eb, slot and key refer to the buffer and key found in the log tree. > @@ -801,15 +818,20 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, > struct btrfs_key *key) > { > struct btrfs_inode_ref *ref; > + struct btrfs_inode_extref *extref; > struct btrfs_dir_item *di; > + struct btrfs_key search_key;You don''t need search_key, just use key from the parameter list as the code did previously.> struct inode *dir; > struct inode *inode; > unsigned long ref_ptr; > unsigned long ref_end; > - char *name; > - int namelen; > + char *name, *victim_name; > + int namelen, victim_name_len;split> int ret; > int search_done = 0; > + int log_ref_ver = 0; > + u64 parent_objectid, inode_objectid, ref_index;split> + struct extent_buffer *leaf; > > /* > * it is possible that we didn''t log all the parent directories > @@ -817,32 +839,56 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, > * copy the back ref in. The link count fixup code will take > * care of the rest > */ > - dir = read_one_inode(root, key->offset); > + > + if (key->type == BTRFS_INODE_EXTREF_KEY) { > + log_ref_ver = 1;^^^^^^^^^^^^^^^ Assigned but never used.> + > + ref_ptr = btrfs_item_ptr_offset(eb, slot); > + > + /* So that we don''t loop back looking for old style log refs. */ > + ref_end = ref_ptr; > + > + extref = (struct btrfs_inode_extref *) btrfs_item_ptr_offset(eb, slot); > + namelen = btrfs_inode_extref_name_len(eb, extref); > + name = kmalloc(namelen, GFP_NOFS);kmalloc may fail.> + > + read_extent_buffer(eb, name, (unsigned long)&extref->name, > + namelen); > + > + ref_index = btrfs_inode_extref_index(eb, extref); > + parent_objectid = btrfs_inode_extref_parent(eb, extref); > + } else { > + parent_objectid = key->offset; > + > + ref_ptr = btrfs_item_ptr_offset(eb, slot); > + ref_end = ref_ptr + btrfs_item_size_nr(eb, slot); > + > + ref = (struct btrfs_inode_ref *)ref_ptr; > + namelen = btrfs_inode_ref_name_len(eb, ref); > + name = kmalloc(namelen, GFP_NOFS); > + BUG_ON(!name); > + > + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); > + > + ref_index = btrfs_inode_ref_index(eb, ref);The way you put it, you had to copy all the code from the else block down before the "goto again" case. Code duplication is error prone. Maybe you can manage to put everything between "again:" and "goto again" into a separate function that can be called multiple times. For the old INODE_REF items that could be done in a loop then. This would also split an overly long function into more readable units.> + } > + > + inode_objectid = key->objectid; > + > + dir = read_one_inode(root, parent_objectid); > if (!dir) > return -ENOENT; > > - inode = read_one_inode(root, key->objectid); > + inode = read_one_inode(root, inode_objectid); > if (!inode) { > iput(dir); > return -EIO; > } > > - ref_ptr = btrfs_item_ptr_offset(eb, slot); > - ref_end = ref_ptr + btrfs_item_size_nr(eb, slot); > - > again: > - ref = (struct btrfs_inode_ref *)ref_ptr; > - > - namelen = btrfs_inode_ref_name_len(eb, ref); > - name = kmalloc(namelen, GFP_NOFS); > - BUG_ON(!name); > - > - read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); > - > /* if we already have a perfect match, we''re done */ > if (inode_in_dir(root, path, btrfs_ino(dir), btrfs_ino(inode), > - btrfs_inode_ref_index(eb, ref), > - name, namelen)) { > + ref_index, name, namelen)) { > goto out; > } > > @@ -857,19 +903,23 @@ again: > if (search_done) > goto insert; > > - ret = btrfs_search_slot(NULL, root, key, path, 0, 0); > + /* Search old style refs */ > + search_key.objectid = inode_objectid; > + search_key.type = BTRFS_INODE_REF_KEY; > + search_key.offset = parent_objectid; > + > + ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); > if (ret == 0) { > - char *victim_name; > - int victim_name_len; > struct btrfs_inode_ref *victim_ref; > unsigned long ptr; > unsigned long ptr_end; > - struct extent_buffer *leaf = path->nodes[0]; > + > + leaf = path->nodes[0]; > > /* are we trying to overwrite a back ref for the root directory > * if so, just jump out, we''re done > */ > - if (key->objectid == key->offset) > + if (search_key.objectid == search_key.offset) > goto out_nowrite; > > /* check all the names in this back reference to see > @@ -889,7 +939,7 @@ again: > (unsigned long)(victim_ref + 1), > victim_name_len); > > - if (!backref_in_log(log, key, victim_name, > + if (!backref_in_log(log, &search_key, victim_name, > victim_name_len)) { > btrfs_inc_nlink(inode); > btrfs_release_path(path); > @@ -902,19 +952,44 @@ again: > ptr = (unsigned long)(victim_ref + 1) + victim_name_len; > } > BUG_ON(ret); > - > - /* > - * NOTE: we have searched root tree and checked the > - * coresponding ref, it does not need to check again. > - */ > - search_done = 1; > } > btrfs_release_path(path); > > + /* Same search but for extended refs */ > + extref = btrfs_lookup_inode_extref(NULL, root, path, name, namelen, > + inode_objectid, parent_objectid, 0, > + 0); > + if (extref && !IS_ERR(extref)) {if (!IS_ERR_OR_NULL(extref))> + victim_name_len = btrfs_inode_extref_name_len(eb, extref); > + victim_name = kmalloc(namelen, GFP_NOFS); > + leaf = path->nodes[0]; > + read_extent_buffer(eb, name, (unsigned long)&extref->name, namelen); > + > + search_key.objectid = inode_objectid; > + search_key.type = BTRFS_INODE_EXTREF_KEY; > + search_key.offset = btrfs_extref_key_off(parent_objectid, name, namelen); > + if (!backref_in_log(log, &search_key, victim_name, > + victim_name_len)) { > + btrfs_inc_nlink(inode); > + btrfs_release_path(path); > + > + ret = btrfs_unlink_inode(trans, root, dir, > + inode, victim_name, > + victim_name_len); > + } > + kfree(victim_name); > + BUG_ON(ret); > + } > + > + /* > + * NOTE: we have searched root tree and checked the > + * coresponding refs, it does not need to be checked again. > + */ > + search_done = 1;You moved this comment and assignment out of the "if (ret == 0)" case. I''m not sure if this is still doing exactly the same now (?). Previously, we were executing another btrfs_search_slot, btrfs_lookup_dir_index_item, ... after the "goto again" case, which would be skipped with this patch.> + > /* look for a conflicting sequence number */ > di = btrfs_lookup_dir_index_item(trans, root, path, btrfs_ino(dir), > - btrfs_inode_ref_index(eb, ref), > - name, namelen, 0); > + ref_index, name, namelen, 0); > if (di && !IS_ERR(di)) { > ret = drop_one_dir_item(trans, root, path, dir, di); > BUG_ON(ret); > @@ -932,17 +1007,25 @@ again: > > insert: > /* insert our name */ > - ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, > - btrfs_inode_ref_index(eb, ref)); > + ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, ref_index); > BUG_ON(ret); > > btrfs_update_inode(trans, root, inode); > > out: > - ref_ptr = (unsigned long)(ref + 1) + namelen; > + ref_ptr = (unsigned long)(ref_ptr + sizeof(struct btrfs_inode_ref)) + namelen; > kfree(name); > - if (ref_ptr < ref_end) > + if (ref_ptr < ref_end) { > + ref = (struct btrfs_inode_ref *)ref_ptr; > + namelen = btrfs_inode_ref_name_len(eb, ref); > + name = kmalloc(namelen, GFP_NOFS); > + BUG_ON(!name); > + > + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); > + > + ref_index = btrfs_inode_ref_index(eb, ref);This is the duplicated code mentioned above.> goto again; > + } > > /* finally write the back reference in the inode */ > ret = overwrite_item(trans, root, path, eb, slot, key); > @@ -965,25 +1048,103 @@ static int insert_orphan_item(struct btrfs_trans_handle *trans, > return ret; > } > > +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, > + u64 start_off, struct btrfs_path *path, > + struct btrfs_inode_extref **ret_ref, u64 *found_off)ret_extref> +{ > + int ret, slot; > + struct btrfs_key key, found_key;split> + struct btrfs_inode_extref *ref; > + struct extent_buffer *leaf; > + struct btrfs_item *item; > + unsigned long ptr; > > -/* > - * There are a few corners where the link count of the file can''t > - * be properly maintained during replay. So, instead of adding > - * lots of complexity to the log code, we just scan the backrefs > - * for any file that has been through replay. > - * > - * The scan will update the link count on the inode to reflect the > - * number of back refs found. If it goes down to zero, the iput > - * will free the inode. > - */ > -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > - struct btrfs_root *root, > - struct inode *inode)[side note: this is really strange patch alignment. my git is doing it the same way.]> + key.objectid = inode_objectid; > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > + key.offset = start_off; > + > + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > + if (ret < 0) > + goto out;Just return here and drop the out label.> + > + while (1) { > + leaf = path->nodes[0]; > + slot = path->slots[0]; > + if (slot >= btrfs_header_nritems(leaf)) { > + /* > + * If the item at offset is not found, > + * btrfs_search_slot will point us to the slot > + * where it should be inserted. In our case > + * that will be the slot directly before the > + * next INODE_REF_KEY_V2 item. In the case > + * that we''re pointing to the last slot in a > + * leaf, we must move one leaf over. > + */ > + ret = btrfs_next_leaf(root, path); > + if (ret) { > + if (ret >= 1) > + ret = -ENOENT; > + break; > + }This large block (and its comment) can be replaced by btrfs_search_slot_for_read with find_higher = 1, return_any = 0. This also avoids the "continue" and we even git rid of the whole while (1) loop here.> + continue; > + } > + > + item = btrfs_item_nr(leaf, slot); > + btrfs_item_key_to_cpu(leaf, &found_key, slot); > + > + /* > + * Check that we''re still looking at an extended ref key for > + * this particular objectid. If we have different > + * objectid or type then there are no more to be found > + * in the tree and we can exit. > + */ > + ret = -ENOENT; > + if (found_key.objectid != inode_objectid) > + break; > + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY) > + break; > + > + ret = 0; > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > + ref = (struct btrfs_inode_extref *)ptr; > + *ret_ref = ref; > + if (found_off) > + *found_off = found_key.offset + 1; > + break; > + } > + > +out: > + return ret; > +} > + > +static int count_inode_extrefs(struct btrfs_root *root, > + struct inode *inode, struct btrfs_path *path) > +{ > + int ret; > + unsigned int nlink = 0; > + u64 inode_objectid = btrfs_ino(inode); > + u64 offset = 0; > + struct btrfs_inode_extref *ref; > + > + while (1) { > + ret = btrfs_find_one_extref(root, inode_objectid, offset, path, > + &ref, &offset); > + if (ret) > + break;Assume the first call to btrfs_find_ione_extref returns -EIO. Do we really want count_inode_extrefs return 0 here? I agree that the previous code suffers from the same problem, but still: it''s a problem.> + > + nlink++; > + offset++; > + } > + > + return nlink; > +} > + > +static int count_inode_refs(struct btrfs_root *root, > + struct inode *inode, struct btrfs_path *path) > { > - struct btrfs_path *path; > int ret; > struct btrfs_key key; > - u64 nlink = 0; > + unsigned int nlink = 0; > unsigned long ptr; > unsigned long ptr_end; > int name_len; > @@ -993,10 +1154,6 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > key.type = BTRFS_INODE_REF_KEY; > key.offset = (u64)-1; > > - path = btrfs_alloc_path(); > - if (!path) > - return -ENOMEM; > - > while (1) { > ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > if (ret < 0) > @@ -1030,6 +1187,48 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > btrfs_release_path(path); > } > btrfs_release_path(path); > + btrfs_free_path(path);In general, you must not free a path when you don''t allocate it.> + > + return nlink; > +} > + > +/* > + * There are a few corners where the link count of the file can''t > + * be properly maintained during replay. So, instead of adding > + * lots of complexity to the log code, we just scan the backrefs > + * for any file that has been through replay. > + * > + * The scan will update the link count on the inode to reflect the > + * number of back refs found. If it goes down to zero, the iput > + * will free the inode. > + */ > +static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > + struct btrfs_root *root, > + struct inode *inode) > +{ > + struct btrfs_path *path; > + int ret; > + u64 nlink = 0; > + u64 ino = btrfs_ino(inode); > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + ret = count_inode_refs(root, inode, path); > + btrfs_release_path(path);Either count_inode_refs should alloc it''s private path, or it should return the path in a released state. The caller should not be responsible to pass a clean path and release it afterwards, as count_inode_refs does not return anything through the path.> + if (ret < 0) > + goto out; > + > + nlink = ret; > + > + ret = count_inode_extrefs(root, inode, path); > + btrfs_release_path(path);Same here. I''d just put the btrfs_release_path into count_inode_(ext)refs.> + if (ret < 0) > + goto out; > + > + nlink += ret; > + > if (nlink != inode->i_nlink) { > set_nlink(inode, nlink); > btrfs_update_inode(trans, root, inode); > @@ -1045,9 +1244,10 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > ret = insert_orphan_item(trans, root, ino); > BUG_ON(ret); > } > - btrfs_free_path(path); > > - return 0; > +out: > + btrfs_free_path(path); > + return ret; > } > > static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans, > @@ -1689,6 +1889,10 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb, > ret = add_inode_ref(wc->trans, root, log, path, > eb, i, &key); > BUG_ON(ret && ret != -ENOENT); > + } else if (key.type == BTRFS_INODE_EXTREF_KEY) { > + ret = add_inode_ref(wc->trans, root, log, path, > + eb, i, &key); > + BUG_ON(ret && ret != -ENOENT); > } else if (key.type == BTRFS_EXTENT_DATA_KEY) { > ret = replay_one_extent(wc->trans, root, path, > eb, i, &key); > diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h > index 2270ac5..fd40ad5 100644 > --- a/fs/btrfs/tree-log.h > +++ b/fs/btrfs/tree-log.h > @@ -49,4 +49,8 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans, > int btrfs_log_new_name(struct btrfs_trans_handle *trans, > struct inode *inode, struct inode *old_dir, > struct dentry *parent); > +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, > + u64 start_off, struct btrfs_path *path, > + struct btrfs_inode_extref **ret_ref, u64 *found_off); > + > #endif-Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mark, While reading 3/3 I stumbled across one more thing in this one: On 05.04.2012 22:09, Mark Fasheh wrote:> +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, > + u64 start_off, struct btrfs_path *path, > + struct btrfs_inode_extref **ret_ref, u64 *found_off) > +{ > + int ret, slot; > + struct btrfs_key key, found_key; > + struct btrfs_inode_extref *ref; > + struct extent_buffer *leaf; > + struct btrfs_item *item; > + unsigned long ptr; > > -/* > - * There are a few corners where the link count of the file can''t > - * be properly maintained during replay. So, instead of adding > - * lots of complexity to the log code, we just scan the backrefs > - * for any file that has been through replay. > - * > - * The scan will update the link count on the inode to reflect the > - * number of back refs found. If it goes down to zero, the iput > - * will free the inode. > - */ > -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > - struct btrfs_root *root, > - struct inode *inode) > + key.objectid = inode_objectid; > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > + key.offset = start_off; > + > + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > + if (ret < 0) > + goto out; > + > + while (1) { > + leaf = path->nodes[0]; > + slot = path->slots[0]; > + if (slot >= btrfs_header_nritems(leaf)) { > + /* > + * If the item at offset is not found, > + * btrfs_search_slot will point us to the slot > + * where it should be inserted. In our case > + * that will be the slot directly before the > + * next INODE_REF_KEY_V2 item. In the case > + * that we''re pointing to the last slot in a > + * leaf, we must move one leaf over. > + */ > + ret = btrfs_next_leaf(root, path); > + if (ret) { > + if (ret >= 1) > + ret = -ENOENT; > + break; > + } > + continue; > + } > + > + item = btrfs_item_nr(leaf, slot); > + btrfs_item_key_to_cpu(leaf, &found_key, slot); > + > + /* > + * Check that we''re still looking at an extended ref key for > + * this particular objectid. If we have different > + * objectid or type then there are no more to be found > + * in the tree and we can exit. > + */ > + ret = -ENOENT; > + if (found_key.objectid != inode_objectid) > + break; > + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY) > + break; > + > + ret = 0; > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > + ref = (struct btrfs_inode_extref *)ptr; > + *ret_ref = ref; > + if (found_off) > + *found_off = found_key.offset + 1;^^^ It''s evil to call it "found offset" an then return one larger than the offset found. No caller would ever expect this.> + break; > + } > + > +out: > + return ret; > +} > + > +static int count_inode_extrefs(struct btrfs_root *root, > + struct inode *inode, struct btrfs_path *path) > +{ > + int ret; > + unsigned int nlink = 0; > + u64 inode_objectid = btrfs_ino(inode); > + u64 offset = 0; > + struct btrfs_inode_extref *ref; > + > + while (1) { > + ret = btrfs_find_one_extref(root, inode_objectid, offset, path, > + &ref, &offset); > + if (ret) > + break; > + > + nlink++; > + offset++;^^^^^^^^ Huh. See? The caller expected to get the offset found from btrfs_find_one_extref. As it stands you might be missing the very next key.> + } > + > + return nlink; > +}-Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 11, 2012 at 03:11:46PM +0200, Jan Schmidt wrote:> Hi Jeff, > > > >> An alternative solution to dealing with collisions could be to > >> emulate the dir-item insertion code - specifically something like > >> insert_with_overflow() which will stuff multiple items under one > >> key. I tend to prefer the idea of > > > > I vote for this option.[ Big patch series, thanks Mark! ] I prefer the insert_with_overflow because it makes the deletion case less complex. If we handle collisions with bits in the offset, we have to search around in the tree to find the key that was actually used to insert the item. The insert_with_overflow code uses just one key, at the cost of having to search around inside the item. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 12:11:13PM -0400, Chris Mason wrote:> On Wed, Apr 11, 2012 at 03:11:46PM +0200, Jan Schmidt wrote: > > Hi Jeff, > > > > > >> An alternative solution to dealing with collisions could be to > > >> emulate the dir-item insertion code - specifically something like > > >> insert_with_overflow() which will stuff multiple items under one > > >> key. I tend to prefer the idea of > > > > > > I vote for this option. > > [ Big patch series, thanks Mark! ] > > I prefer the insert_with_overflow because it makes the deletion case > less complex. If we handle collisions with bits in the offset, we have > to search around in the tree to find the key that was actually used to > insert the item. > > The insert_with_overflow code uses just one key, at the cost of having > to search around inside the item.Yeah this actually turns out to be a bit easier to code as well. I''m taking this approach. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05.04.2012 22:09, Mark Fasheh wrote:> The iterate_irefs in backref.c is used to build path components from inode > refs. I had to add a 2nd iterate function callback to handle extended refs. > > Both iterate callbacks eventually converge upon iref_to_path() which I was > able to keep as one function with some small code to abstract away > differences in the two disk structures. > > Signed-off-by: Mark Fasheh <mfasheh@suse.de> > --- > fs/btrfs/backref.c | 200 ++++++++++++++++++++++++++++++++++++++++++---------- > fs/btrfs/backref.h | 4 +- > 2 files changed, 165 insertions(+), 39 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index 0436c12..f2b8952 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -22,6 +22,7 @@ > #include "ulist.h" > #include "transaction.h" > #include "delayed-ref.h" > +#include "tree-log.h"I mentioned that in the 2/3 review mail. I''d rather get rid of this #include here.> > /* > * this structure records all encountered refs on the way up to the root > @@ -858,62 +859,75 @@ static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, > } > > /* > - * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements > - * of the path are separated by ''/'' and the path is guaranteed to be > - * 0-terminated. the path is only given within the current file system. > - * Therefore, it never starts with a ''/''. the caller is responsible to provide > - * "size" bytes in "dest". the dest buffer will be filled backwards. finally, > - * the start point of the resulting string is returned. this pointer is within > - * dest, normally. > - * in case the path buffer would overflow, the pointer is decremented further > - * as if output was written to the buffer, though no more output is actually > - * generated. that way, the caller can determine how much space would be > - * required for the path to fit into the buffer. in that case, the returned > - * value will be smaller than dest. callers must check this! > + * Given the parent objectid and name/name_len pairs of an inode ref > + * (any version) this iterates to turn that information into a > + * full filesystem path. elements of the path are separated by ''/'' and > + * the path is guaranteed to be 0-terminated. the path is only given > + * within the current file system. Therefore, it never starts with a > + * ''/''. the caller is responsible to provide "size" bytes in > + * "dest". the dest buffer will be filled backwards. finally, the > + * start point of the resulting string is returned. this pointer is > + * within dest, normally. in case the path buffer would overflow, the > + * pointer is decremented further as if output was written to the > + * buffer, though no more output is actually generated. that way, the > + * caller can determine how much space would be required for the path > + * to fit into the buffer. in that case, the returned value will be > + * smaller than dest. callers must check this!It would reduce patch sets if you can extend comments in a compatible way, you make reviewers happy if you don''t realign text (or, later, function parameters) where it''s not required.> */ > static char *iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, > - struct btrfs_inode_ref *iref, > - struct extent_buffer *eb_in, u64 parent, > - char *dest, u32 size) > + int name_len, unsigned long name_off,name_len should be u32> + struct extent_buffer *eb_in, u64 parent, > + char *dest, u32 size) > { > - u32 len; > int slot; > u64 next_inum; > int ret; > s64 bytes_left = size - 1; > struct extent_buffer *eb = eb_in; > struct btrfs_key found_key; > + struct btrfs_inode_ref *iref; > + struct btrfs_inode_extref *iref2;iextref> > if (bytes_left >= 0) > dest[bytes_left] = ''\0''; > > while (1) { > - len = btrfs_inode_ref_name_len(eb, iref); > - bytes_left -= len; > + bytes_left -= name_len; > if (bytes_left >= 0) > read_extent_buffer(eb, dest + bytes_left, > - (unsigned long)(iref + 1), len); > + name_off, name_len); > if (eb != eb_in) > free_extent_buffer(eb); > + > + /* Ok, we have enough to find any refs to the parent inode. */ > ret = inode_ref_info(parent, 0, fs_root, path, &found_key); > - if (ret > 0) > - ret = -ENOENT; > - if (ret) > - break; > next_inum = found_key.offset; > + if (ret == 0) { > + slot = path->slots[0]; > + eb = path->nodes[0]; > + /* make sure we can use eb after releasing the path */ > + if (eb != eb_in) > + atomic_inc(&eb->refs); > + btrfs_release_path(path); > + iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); > + > + name_len = btrfs_inode_ref_name_len(eb, iref); > + name_off = (unsigned long)(iref + 1); > + } else { > + ret = btrfs_find_one_extref(fs_root, parent, 0, path, > + &iref2, NULL); > + if (ret) > + break; > + > + next_inum = btrfs_inode_extref_parent(eb, iref2); > + name_off = (unsigned long)&iref2->name; > + name_len = btrfs_inode_extref_name_len(eb, iref2); > + } > > /* regular exit ahead */ > if (parent == next_inum) > break;These regular exit lines must go before the block you inserted. Otherwise we leak a reference on eb if it''s != eb_in. Whereas I think we don''t need this if-else-construct at all. We do need the changes you made as to passing name_len and name_off, I agree. However, the rest of the function should stay as it was, because the parent of each object must be a directory and a directory won''t have hard links. Thus, we''ll never meet INODE_EXTREFs when walking up the path. Or did I miss something?> > - slot = path->slots[0]; > - eb = path->nodes[0]; > - /* make sure we can use eb after releasing the path */ > - if (eb != eb_in) > - atomic_inc(&eb->refs); > - btrfs_release_path(path); > - > - iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); > parent = next_inum; > --bytes_left; > if (bytes_left >= 0) > @@ -1226,9 +1240,9 @@ int iterate_inodes_from_logical(u64 logical, struct btrfs_fs_info *fs_info, > return ret; > } > > -static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > - struct btrfs_path *path, > - iterate_irefs_t *iterate, void *ctx) > +static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root, > + struct btrfs_path *path, > + iterate_irefs_t *iterate, void *ctx)This function must not call free_extent_buffer(eb) in line 1306 after applying your patch set (immediately before the break). Second, I think we''d better add a blocking read lock on eb after incrementing it''s refcount, because we need the current content to stay as it is. Both isn''t part of your patches, but it might be easier if you make that bugfix change as a 3/4 patch within your set and turn this one into 4/4. If you don''t like that, I''ll send a separate patch for it. Don''t miss the unlock if you do it ;-)> { > int ret; > int slot; > @@ -1244,7 +1258,7 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > > while (1) { > ret = inode_ref_info(inum, parent ? parent+1 : 0, fs_root, path, > - &found_key); > + &found_key); > if (ret < 0) > break; > if (ret) { > @@ -1286,6 +1300,76 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > return ret; > } > > +static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, > + struct btrfs_path *path, > + iterate_extrefs_t *iterate, void *ctx) > +{ > + int ret; > + int slot; > + u64 offset = 0; > + u64 parent; > + int found = 0; > + struct extent_buffer *eb; > + struct btrfs_item *item; > + struct btrfs_inode_extref *iref2;iextref> + > + while (1) { > + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, > + &offset); > + if (ret < 0) > + break; > + if (ret) { > + ret = found ? 0 : -ENOENT; > + break; > + } > + ++found; > + > + slot = path->slots[0]; > + eb = path->nodes[0]; > + /* make sure we can use eb after releasing the path */ > + atomic_inc(&eb->refs);You need a blocking read lock here, too. Grab it before releasing the path.> + btrfs_release_path(path); > + > + item = btrfs_item_nr(eb, slot);You don''t need item.> + iref2 = btrfs_item_ptr(eb, slot, struct btrfs_inode_extref); > + > + parent = btrfs_inode_extref_parent(eb, iref2); > + ret = iterate(parent, iref2, eb, ctx);The caller shouldn''t have to deal with two different types of callbacks. Please just build a dummy struct btrfs_inode_ref object here and pass it to iterate. Alternatively, you can extract the information the caller will need, here, and pass that instead of a struct btrfs_inode_ref. This way, we can use the same type for both iterate() functions.> + if (ret) { > + free_extent_buffer(eb); > + break; > + } > + > + free_extent_buffer(eb);Call free_extent_buffer(eb) before the if (ret), drop it from the if block, add an unlock before the if block.> + offset++;Another caller not expecting btrfs_find_one_extref to return offset incremented. The offset++ should stay here and btrfs_find_one_extref should just return the plain offset.> + } > + > + btrfs_release_path(path); > + > + return ret; > +} > + > +static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > + struct btrfs_path *path, > + iterate_irefs_t *iterate, > + iterate_extrefs_t *iterate2, void *ctx)As mentioned above, I''d like to see only be a single iterate function at this level.> +{ > + int ret, found_refs = 0;split> + > + ret = iterate_inode_refs(inum, fs_root, path, iterate, ctx); > + if (ret && ret != -ENOENT) > + return ret; > + > + if (ret != -ENOENT) > + ++found_refs;I''d make those 2 if statements: if (!ret) ++found_refs; else if (ret != -ENOENT) return ret;> + > + ret = iterate_inode_extrefs(inum, fs_root, path, iterate2, ctx); > + if (ret == -ENOENT && found_refs) > + return 0; > + > + return ret; > +} > + > /* > * returns 0 if the path could be dumped (probably truncated) > * returns <0 in case of an error > @@ -1299,13 +1383,53 @@ static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, > int i = ipath->fspath->elem_cnt; > const int s_ptr = sizeof(char *); > u32 bytes_left; > + u32 name_len = btrfs_inode_ref_name_len(eb, iref); > + > + bytes_left = ipath->fspath->bytes_left > s_ptr ? > + ipath->fspath->bytes_left - s_ptr : 0; > + > + fspath_min = (char *)ipath->fspath->val + (i + 1) * s_ptr; > + fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, name_len, > + (unsigned long)(iref + 1), eb, inum, fspath_min, > + bytes_left); > + if (IS_ERR(fspath)) > + return PTR_ERR(fspath); > + > + if (fspath > fspath_min) { > + ipath->fspath->val[i] = (u64)(unsigned long)fspath; > + ++ipath->fspath->elem_cnt; > + ipath->fspath->bytes_left = fspath - fspath_min; > + } else { > + ++ipath->fspath->elem_missed; > + ipath->fspath->bytes_missing += fspath_min - fspath; > + ipath->fspath->bytes_left = 0; > + } > + > + return 0; > +} > + > +/* > + * returns 0 if the path could be dumped (probably truncated) > + * returns <0 in case of an error > + */ > +static int inode_to_path2(u64 inum, struct btrfs_inode_extref *iref2, > + struct extent_buffer *eb, void *ctx)We''ll get rid of a second inode_to_path, too, if my suggestion works out.> +{ > + struct inode_fs_paths *ipath = ctx; > + char *fspath; > + char *fspath_min; > + int i = ipath->fspath->elem_cnt; > + const int s_ptr = sizeof(char *); > + u32 bytes_left; > + u32 name_len = btrfs_inode_extref_name_len(eb, iref2); > > bytes_left = ipath->fspath->bytes_left > s_ptr ? > ipath->fspath->bytes_left - s_ptr : 0; > > fspath_min = (char *)ipath->fspath->val + (i + 1) * s_ptr; > - fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, iref, eb, > - inum, fspath_min, bytes_left); > + fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, name_len, > + (unsigned long)&iref2->name, eb, inum, > + fspath_min, bytes_left); > if (IS_ERR(fspath)) > return PTR_ERR(fspath); > > @@ -1339,7 +1463,7 @@ static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, > int paths_from_inode(u64 inum, struct inode_fs_paths *ipath) > { > return iterate_irefs(inum, ipath->fs_root, ipath->btrfs_path, > - inode_to_path, ipath); > + inode_to_path, inode_to_path2, ipath); > } > > /* > diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h > index d00dfa9..4634ed7 100644 > --- a/fs/btrfs/backref.h > +++ b/fs/btrfs/backref.h > @@ -31,7 +31,9 @@ struct inode_fs_paths { > typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root, > void *ctx); > typedef int (iterate_irefs_t)(u64 parent, struct btrfs_inode_ref *iref, > - struct extent_buffer *eb, void *ctx); > + struct extent_buffer *eb, void *ctx); > +typedef int (iterate_extrefs_t)(u64 parent, struct btrfs_inode_extref *iref, > + struct extent_buffer *eb, void *ctx); > > int inode_item_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, > struct btrfs_path *path);Thanks! -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12.04.2012 19:59, Jan Schmidt wrote:>> -static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, >> - struct btrfs_path *path, >> - iterate_irefs_t *iterate, void *ctx) >> +static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root, >> + struct btrfs_path *path, >> + iterate_irefs_t *iterate, void *ctx) > > This function must not call free_extent_buffer(eb) in line 1306 after > applying your patch set (immediately before the break). Second, I think > we''d better add a blocking read lock on eb after incrementing it''s > refcount, because we need the current content to stay as it is. Both > isn''t part of your patches, but it might be easier if you make that > bugfix change as a 3/4 patch within your set and turn this one into 4/4. > If you don''t like that, I''ll send a separate patch for it. Don''t miss > the unlock if you do it ;-)FYI: There are more read locks missing in the current version of backref.c. So I made a bugfix patch myself which I''ll test and send tomorrow. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan, firstly thanks very much for the thorough review! I wanted to make sure I got the collision handling going before addressing the issues you found. Again thanks, and my notes are below. On Thu, Apr 12, 2012 at 03:08:27PM +0200, Jan Schmidt wrote:> > +/* > > + * Theoretical limit is larger, but we keep this down to a sane > > + * value. That should limit greatly the possibility of collisions on > > + * inode ref items. > > + */ > > +#define BTRFS_LINK_MAX 65535U > > Do we really want an artificial limit like that?I took a look at other file systems and this seems to be a pretty reasonable maximum. Ext4 and XFS in particular have similar limits. Granted, btrfs should be better equipped to deal with large numbers of hard links - at least in the area of consistency checking. That said, I think keeping this down to "only" 65 thousand hard links per inode should work out fine. Also it''s not a large change to increase or remove the limit should we have to.> > -struct btrfs_inode_ref * > > +static int find_name_in_ext_backref(struct btrfs_path *path, const char *name, > > + int name_len, struct btrfs_inode_extref **ref_ret) > > +{ > > + struct extent_buffer *leaf; > > + struct btrfs_inode_extref *ref; > > + unsigned long ptr; > > + unsigned long name_ptr; > > + u32 item_size; > > + > > + leaf = path->nodes[0]; > > + item_size = btrfs_item_size_nr(leaf, path->slots[0]); > > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > > + ref = (struct btrfs_inode_extref *)ptr; > > + > > + /* > > + * There isn''t actually anything to "search" in an extended > > + * backref - one name is directly attached to each > > + * item. Instead what we''re really doing here is some sort of > > + * sanity check - does the name exist where it should have > > + * been found. If all is well, we''ll return success and the > > + * inode ref object. > > + */ > > In that case, the name is not a good choice. However, if we change how > we deal with hash collisions, the name might be right and the comment > might go away as we really look through more than one item. If we leave > it like this, please rename that function.Yeah, great catch. In the newer version as you''ve noted, it *is* actually a ''find me the name'' function, so the comment (and resulting EROFS) have been removed.> > + name_ptr = (unsigned long)(&ref->name); > > + if (btrfs_inode_extref_name_len(leaf, ref) == name_len > > + && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { > > + *ref_ret = ref; > > + return 1; > > + } > > + return 0; > > +} > > + > > +/* > > + * Figure the key offset of an extended inode ref > > + */ > > +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len) > > I''d suggest to call this one btrfs_extref_hash and move it to > fs/btrfs/hash.h. That way, we can also drop the include for > <linux/crc32c.h> above.Done.> > +{ > > + return (u64) crc32c(parent_objectid, name, name_len); > > +} > > + > > +static struct btrfs_inode_ref * > > btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, > > - struct btrfs_root *root, > > - struct btrfs_path *path, > > - const char *name, int name_len, > > - u64 inode_objectid, u64 ref_objectid, int mod) > > + struct btrfs_root *root, > > + struct btrfs_path *path, > > + const char *name, int name_len, > > + u64 inode_objectid, u64 ref_objectid, int ins_len, > > + int cow) > > { > > + int ret; > > struct btrfs_key key; > > struct btrfs_inode_ref *ref; > > - int ins_len = mod < 0 ? -1 : 0; > > - int cow = mod != 0; > > - int ret; > > > > key.objectid = inode_objectid; > > key.type = BTRFS_INODE_REF_KEY; > > @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, > > return ref; > > } > > > > -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > > +struct btrfs_inode_extref * > > +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, > > + struct btrfs_root *root, > > + struct btrfs_path *path, > > + const char *name, int name_len, > > + u64 inode_objectid, u64 ref_objectid, int ins_len, > > + int cow) > > +{ > > + int ret; > > + struct btrfs_key key; > > + struct btrfs_inode_extref *ref; > ^^^ > Please use some other identifier than the one we use for struct > btrfs_inode_ref. Suggestion: extref or eref.Yeah I''ll make sure they''re all changed to extref. Makes it much easier to read.> > + > > + key.objectid = inode_objectid; > > + key.type = BTRFS_INODE_EXTREF_KEY; > > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); > ^^^^^^^^^^^^^^^^^^^^ > Get''s clearer when that is called btrfs_extref_hash. > > > + > > + ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); > > + if (ret < 0) > > + return ERR_PTR(ret); > > + if (ret > 0) > > + return NULL; > > + if (!find_name_in_ext_backref(path, name, name_len, &ref)) > > + return NULL; > > + return ref; > > +} > > + > > +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, > > + struct btrfs_root *root, > > + struct btrfs_path *path, > > + const char *name, int name_len, > > + u64 inode_objectid, u64 ref_objectid, int mod, > > + u64 *ret_index) > > +{ > > + struct btrfs_inode_ref *ref1; > > + struct btrfs_inode_extref *ref2; > ^^^^ > Please, use "ref" and ("eref" or "extref"). > > > + int ins_len = mod < 0 ? -1 : 0; > > + int cow = mod != 0; > > + > > + ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len, > > + inode_objectid, ref_objectid, ins_len, > > + cow); > > + if (IS_ERR(ref1)) > > + return PTR_ERR(ref1); > > + > > + if (ref1 != NULL) { > > + *ret_index = btrfs_inode_ref_index(path->nodes[0], ref1); > > + return 0; > > + } > > + > > + btrfs_release_path(path); > > + > > + ref2 = btrfs_lookup_inode_extref(trans, root, path, name, > > + name_len, inode_objectid, > > + ref_objectid, ins_len, cow); > > + if (IS_ERR(ref2)) > > + return PTR_ERR(ref2); > > + > > + if (ref2) { > > + *ret_index = btrfs_inode_extref_index(path->nodes[0], ref2); > > + return 0; > > + } > > + > > + return -ENOENT; > > +} > > + > > +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans, > > struct btrfs_root *root, > > const char *name, int name_len, > > u64 inode_objectid, u64 ref_objectid, u64 *index) > > { > > struct btrfs_path *path; > > struct btrfs_key key; > > + struct btrfs_inode_extref *ref2; > ^^^^ > > > + struct extent_buffer *leaf; > > + int ret; > > + > > + key.objectid = inode_objectid; > > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); > > + > > + path = btrfs_alloc_path(); > > + if (!path) > > + return -ENOMEM; > > + > > + path->leave_spinning = 1; > > + > > + ret = btrfs_search_slot(trans, root, &key, path, -1, 1); > > + if (ret > 0) { > > + ret = -ENOENT; > > + goto out; > > + } else if (ret < 0) { > > + goto out; > > + } > > Clearer (to me): > > if (ret > 0) > ret = -ENOENT; > if (ret < 0) > goto out;Agreed, changed.> > > + > > + /* > > + * Sanity check - did we find the right item for this name? > > + * This should always succeed so error here will make the FS > > + * readonly. > > + */ > > + if (!find_name_in_ext_backref(path, name, name_len, &ref2)) { > > + btrfs_std_error(root->fs_info, -ENOENT); > > + ret = -EROFS; > > Simply returning -EROFS doesn''t make the fs read-only, does it? I''d > suggest to return -EIO and drop the comment above.Actually, the btrfs_std_error() line above that makes the FS readonly (in which case EROFS is a good return value). Btw, I think that since is this a request to delete the item that not finding it in the tree (even with the possiblity of collision handling) is still a possible corruption.> > + goto out; > > + } > > + > > + leaf = path->nodes[0]; > > + if (index) > > + *index = btrfs_inode_extref_index(leaf, ref2); > > + > > + ret = btrfs_del_item(trans, root, path); > > + > > +out: > > + btrfs_free_path(path); > > + > > + return ret; > > +} > > + > > +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > > + struct btrfs_root *root, > > + const char *name, int name_len, > > + u64 inode_objectid, u64 ref_objectid, u64 *index) > > +{ > > + struct btrfs_path *path; > > + struct btrfs_key key; > > struct btrfs_inode_ref *ref; > > struct extent_buffer *leaf; > > unsigned long ptr; > > unsigned long item_start; > > u32 item_size; > > u32 sub_item_len; > > - int ret; > > + int ret, search_ext_refs = 0; > ^^^^^^^^^^^^^^^^^^^^^ > Documentation/CodingStyle: > -- > To this end, use just one data declaration per line (no commas for > multiple data declarations). > -- > > You do this multiple times in your whole patch set.Sure, I can change that. There''s lots of kernel code that blatantly ignores this but I suppose that''s no excuse...> > int del_len = name_len + sizeof(*ref); > > > > key.objectid = inode_objectid; > > @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > > ret = btrfs_search_slot(trans, root, &key, path, -1, 1); > > if (ret > 0) { > > ret = -ENOENT; > > + search_ext_refs = 1; > > goto out; > > } else if (ret < 0) { > > goto out; > > } > > if (!find_name_in_backref(path, name, name_len, &ref)) { > > ret = -ENOENT; > > + search_ext_refs = 1; > > goto out; > > } > > leaf = path->nodes[0]; > > @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, > > item_size - sub_item_len, 1); > > out: > > btrfs_free_path(path); > > + > > + if (search_ext_refs) { > > As far as I see it, you can drop search_ext_refs and simply go this way > on ret == -ENOENT.I actually made it explicit on purpose. If you look at the function there''s several places where ret can be set and then bubbled down. Instead of depending on the called functions never setting -ENOENT (and future programmers to get that it''s a ''magic'' value), I prefer to just explicitely set a single int variable.> > + /* > ^ > Catched by accident: trailing whitespace.Oh thanks.> > > + * No refs were found, or we could not find the > > + * name in our ref array. Find and remove the extended > > + * inode ref then. > > + */ > > + return btrfs_del_inode_extref(trans, root, name, name_len, > > + inode_objectid, ref_objectid, index); > > + } > > + > > + return ret; > > +} > > + > > +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans, > > + struct btrfs_root *root, > > + const char *name, int name_len, > > + u64 inode_objectid, u64 ref_objectid, u64 index) > > Are we missing a check against BTRFS_LINK_MAX in this function?No, the caller is supposed to check this, in theory before starting a transaction to change the FS. So btrfs_link() does the checking for us.> > +{ > > + struct btrfs_path *path; > > + struct btrfs_key key; > > + struct btrfs_inode_extref *ref; > extref > > > + unsigned long ptr; > > + int ret; > > + int ins_len = name_len + sizeof(*ref); > > + > > + key.objectid = inode_objectid; > > + key.type = BTRFS_INODE_EXTREF_KEY; > > + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); > > + > > + path = btrfs_alloc_path(); > > + if (!path) > > + return -ENOMEM; > > + > > + path->leave_spinning = 1; > > + ret = btrfs_insert_empty_item(trans, root, path, &key, > > + ins_len); > > + if (ret < 0) > > + goto out; > > + > > + ref = btrfs_item_ptr(path->nodes[0], path->slots[0], > > + struct btrfs_inode_extref); > > + > > + btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len); > > + btrfs_set_inode_extref_index(path->nodes[0], ref, index); > > + btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid); > > + > > + ptr = (unsigned long)&ref->name; > > + write_extent_buffer(path->nodes[0], name, ptr, name_len); > > + btrfs_mark_buffer_dirty(path->nodes[0]); > > + > > +out: > > + btrfs_free_path(path); > > return ret; > > } > > > > @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans, > > > > out: > > btrfs_free_path(path); > > + > > + if (ret == -EMLINK) { > > + struct btrfs_super_block *disk_super = root->fs_info->super_copy; > > + /* We ran out of space in the ref array. Need to > > + * add an extended ref. */ > > + if (btrfs_super_incompat_flags(disk_super) > > + & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) > > + ret = btrfs_insert_inode_extref(trans, root, name, > > + name_len, > > + inode_objectid, > > + ref_objectid, index); > > + } > > + > > return ret; > > } > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 892b347..0ca525d 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, > > struct btrfs_trans_handle *trans; > > struct btrfs_root *root = BTRFS_I(dir)->root; > > struct btrfs_path *path; > > - struct btrfs_inode_ref *ref; > > struct btrfs_dir_item *di; > > struct inode *inode = dentry->d_inode; > > u64 index; > > @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, > > } > > btrfs_release_path(path); > > > > - ref = btrfs_lookup_inode_ref(trans, root, path, > > - dentry->d_name.name, dentry->d_name.len, > > - ino, dir_ino, 0); > > - if (IS_ERR(ref)) { > > - err = PTR_ERR(ref); > > + ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name, > > + dentry->d_name.len, ino, dir_ino, 0, &index); > > This line is >80 chars. Please use scripts/checkpatch.pl to catch those.Fixed.> > + if (ret) { > > + err = ret; > > goto out; > > } > > - BUG_ON(!ref); > > + > > if (check_path_shared(root, path)) > > goto out; > > - index = btrfs_inode_ref_index(path->nodes[0], ref); > > + > > btrfs_release_path(path); > > > > /* > > @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, > > btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY); > > key[0].offset = 0; > > > > + /* > > + * Start new inodes with a V1 ref. This is slightly more > > + * efficient for small numbers of hard links. > > + */ > > I''d drop that comment. extref is no "V2" kind of ref. But you can leave > it in if you feel it helps anyone later.Yeah at one point they were called ''v2'' due to developer laziness :) I''ve updated the comment: /* * Start new inodes with an inode_ref. This is slightly more * efficient for small numbers of hard links since they will * be packed into one item. Extended refs will kick in if we * add more hard links than can fit in the ref item. */ It could probably still be improved beyond that. Mostly I want to point out that we never start with extended refs so that anyone looking at the code will understand the ''flow'' of btrfs inode references.> > key[1].objectid = objectid; > > btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY); > > key[1].offset = ref_objectid; > > @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, > > if (root->objectid != BTRFS_I(inode)->root->objectid) > > return -EXDEV; > > > > - if (inode->i_nlink == ~0U) > > + if (inode->i_nlink >= BTRFS_LINK_MAX) > > return -EMLINK; > > > > err = btrfs_set_inode_index(dir, &index); > > Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>Thanks again Jan. I''ll be sure to CC you on my next drop! --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 25.04.2012 00:23, Mark Fasheh wrote:> Jan, firstly thanks very much for the thorough review! I wanted to make sure > I got the collision handling going before addressing the issues you found. > Again thanks, and my notes are below.> On Thu, Apr 12, 2012 at 03:08:27PM +0200, Jan Schmidt wrote: >>> +/* >>> + * Theoretical limit is larger, but we keep this down to a sane >>> + * value. That should limit greatly the possibility of collisions on >>> + * inode ref items. >>> + */ >>> +#define BTRFS_LINK_MAX 65535U >> >> Do we really want an artificial limit like that? > > I took a look at other file systems and this seems to be a pretty reasonable > maximum. Ext4 and XFS in particular have similar limits. > > Granted, btrfs should be better equipped to deal with large numbers of hard > links - at least in the area of consistency checking. > > That said, I think keeping this down to "only" 65 thousand hard links per > inode should work out fine. Also it''s not a large change to increase or > remove the limit should we have to.Agreed.>>> -struct btrfs_inode_ref * >>> +static int find_name_in_ext_backref(struct btrfs_path *path, const char *name, >>> + int name_len, struct btrfs_inode_extref **ref_ret) >>> +{ >>> + struct extent_buffer *leaf; >>> + struct btrfs_inode_extref *ref; >>> + unsigned long ptr; >>> + unsigned long name_ptr; >>> + u32 item_size; >>> + >>> + leaf = path->nodes[0]; >>> + item_size = btrfs_item_size_nr(leaf, path->slots[0]); >>> + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); >>> + ref = (struct btrfs_inode_extref *)ptr; >>> + >>> + /* >>> + * There isn''t actually anything to "search" in an extended >>> + * backref - one name is directly attached to each >>> + * item. Instead what we''re really doing here is some sort of >>> + * sanity check - does the name exist where it should have >>> + * been found. If all is well, we''ll return success and the >>> + * inode ref object. >>> + */ >> >> In that case, the name is not a good choice. However, if we change how >> we deal with hash collisions, the name might be right and the comment >> might go away as we really look through more than one item. If we leave >> it like this, please rename that function. > > Yeah, great catch. In the newer version as you''ve noted, it *is* actually a > ''find me the name'' function, so the comment (and resulting EROFS) have been > removed. > > >>> + name_ptr = (unsigned long)(&ref->name); >>> + if (btrfs_inode_extref_name_len(leaf, ref) == name_len >>> + && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { >>> + *ref_ret = ref; >>> + return 1; >>> + } >>> + return 0; >>> +} >>> + >>> +/* >>> + * Figure the key offset of an extended inode ref >>> + */ >>> +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len) >> >> I''d suggest to call this one btrfs_extref_hash and move it to >> fs/btrfs/hash.h. That way, we can also drop the include for >> <linux/crc32c.h> above. > > Done. > > >>> +{ >>> + return (u64) crc32c(parent_objectid, name, name_len); >>> +} >>> + >>> +static struct btrfs_inode_ref * >>> btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, >>> - struct btrfs_root *root, >>> - struct btrfs_path *path, >>> - const char *name, int name_len, >>> - u64 inode_objectid, u64 ref_objectid, int mod) >>> + struct btrfs_root *root, >>> + struct btrfs_path *path, >>> + const char *name, int name_len, >>> + u64 inode_objectid, u64 ref_objectid, int ins_len, >>> + int cow) >>> { >>> + int ret; >>> struct btrfs_key key; >>> struct btrfs_inode_ref *ref; >>> - int ins_len = mod < 0 ? -1 : 0; >>> - int cow = mod != 0; >>> - int ret; >>> >>> key.objectid = inode_objectid; >>> key.type = BTRFS_INODE_REF_KEY; >>> @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans, >>> return ref; >>> } >>> >>> -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, >>> +struct btrfs_inode_extref * >>> +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, >>> + struct btrfs_root *root, >>> + struct btrfs_path *path, >>> + const char *name, int name_len, >>> + u64 inode_objectid, u64 ref_objectid, int ins_len, >>> + int cow) >>> +{ >>> + int ret; >>> + struct btrfs_key key; >>> + struct btrfs_inode_extref *ref; >> ^^^ >> Please use some other identifier than the one we use for struct >> btrfs_inode_ref. Suggestion: extref or eref. > > Yeah I''ll make sure they''re all changed to extref. Makes it much easier to > read. > > >>> + >>> + key.objectid = inode_objectid; >>> + key.type = BTRFS_INODE_EXTREF_KEY; >>> + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); >> ^^^^^^^^^^^^^^^^^^^^ >> Get''s clearer when that is called btrfs_extref_hash. >> >>> + >>> + ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); >>> + if (ret < 0) >>> + return ERR_PTR(ret); >>> + if (ret > 0) >>> + return NULL; >>> + if (!find_name_in_ext_backref(path, name, name_len, &ref)) >>> + return NULL; >>> + return ref; >>> +} >>> + >>> +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans, >>> + struct btrfs_root *root, >>> + struct btrfs_path *path, >>> + const char *name, int name_len, >>> + u64 inode_objectid, u64 ref_objectid, int mod, >>> + u64 *ret_index) >>> +{ >>> + struct btrfs_inode_ref *ref1; >>> + struct btrfs_inode_extref *ref2; >> ^^^^ >> Please, use "ref" and ("eref" or "extref"). >> >>> + int ins_len = mod < 0 ? -1 : 0; >>> + int cow = mod != 0; >>> + >>> + ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len, >>> + inode_objectid, ref_objectid, ins_len, >>> + cow); >>> + if (IS_ERR(ref1)) >>> + return PTR_ERR(ref1); >>> + >>> + if (ref1 != NULL) { >>> + *ret_index = btrfs_inode_ref_index(path->nodes[0], ref1); >>> + return 0; >>> + } >>> + >>> + btrfs_release_path(path); >>> + >>> + ref2 = btrfs_lookup_inode_extref(trans, root, path, name, >>> + name_len, inode_objectid, >>> + ref_objectid, ins_len, cow); >>> + if (IS_ERR(ref2)) >>> + return PTR_ERR(ref2); >>> + >>> + if (ref2) { >>> + *ret_index = btrfs_inode_extref_index(path->nodes[0], ref2); >>> + return 0; >>> + } >>> + >>> + return -ENOENT; >>> +} >>> + >>> +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans, >>> struct btrfs_root *root, >>> const char *name, int name_len, >>> u64 inode_objectid, u64 ref_objectid, u64 *index) >>> { >>> struct btrfs_path *path; >>> struct btrfs_key key; >>> + struct btrfs_inode_extref *ref2; >> ^^^^ >> >>> + struct extent_buffer *leaf; >>> + int ret; >>> + >>> + key.objectid = inode_objectid; >>> + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); >>> + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); >>> + >>> + path = btrfs_alloc_path(); >>> + if (!path) >>> + return -ENOMEM; >>> + >>> + path->leave_spinning = 1; >>> + >>> + ret = btrfs_search_slot(trans, root, &key, path, -1, 1); >>> + if (ret > 0) { >>> + ret = -ENOENT; >>> + goto out; >>> + } else if (ret < 0) { >>> + goto out; >>> + } >> >> Clearer (to me): >> >> if (ret > 0) >> ret = -ENOENT; >> if (ret < 0) >> goto out; > > Agreed, changed. > >> >>> + >>> + /* >>> + * Sanity check - did we find the right item for this name? >>> + * This should always succeed so error here will make the FS >>> + * readonly. >>> + */ >>> + if (!find_name_in_ext_backref(path, name, name_len, &ref2)) { >>> + btrfs_std_error(root->fs_info, -ENOENT); >>> + ret = -EROFS; >> >> Simply returning -EROFS doesn''t make the fs read-only, does it? I''d >> suggest to return -EIO and drop the comment above. > > Actually, the btrfs_std_error() line above that makes the FS readonly (in > which case EROFS is a good return value).Got it. -EROFS is a good choice, then.> Btw, I think that since is this a request to delete the item that not > finding it in the tree (even with the possiblity of collision handling) is > still a possible corruption. > > >>> + goto out; >>> + } >>> + >>> + leaf = path->nodes[0]; >>> + if (index) >>> + *index = btrfs_inode_extref_index(leaf, ref2); >>> + >>> + ret = btrfs_del_item(trans, root, path); >>> + >>> +out: >>> + btrfs_free_path(path); >>> + >>> + return ret; >>> +} >>> + >>> +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, >>> + struct btrfs_root *root, >>> + const char *name, int name_len, >>> + u64 inode_objectid, u64 ref_objectid, u64 *index) >>> +{ >>> + struct btrfs_path *path; >>> + struct btrfs_key key; >>> struct btrfs_inode_ref *ref; >>> struct extent_buffer *leaf; >>> unsigned long ptr; >>> unsigned long item_start; >>> u32 item_size; >>> u32 sub_item_len; >>> - int ret; >>> + int ret, search_ext_refs = 0; >> ^^^^^^^^^^^^^^^^^^^^^ >> Documentation/CodingStyle: >> -- >> To this end, use just one data declaration per line (no commas for >> multiple data declarations). >> -- >> >> You do this multiple times in your whole patch set. > > Sure, I can change that. There''s lots of kernel code that blatantly ignores > this but I suppose that''s no excuse... > > >>> int del_len = name_len + sizeof(*ref); >>> >>> key.objectid = inode_objectid; >>> @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, >>> ret = btrfs_search_slot(trans, root, &key, path, -1, 1); >>> if (ret > 0) { >>> ret = -ENOENT; >>> + search_ext_refs = 1; >>> goto out; >>> } else if (ret < 0) { >>> goto out; >>> } >>> if (!find_name_in_backref(path, name, name_len, &ref)) { >>> ret = -ENOENT; >>> + search_ext_refs = 1; >>> goto out; >>> } >>> leaf = path->nodes[0]; >>> @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans, >>> item_size - sub_item_len, 1); >>> out: >>> btrfs_free_path(path); >>> + >>> + if (search_ext_refs) { >> >> As far as I see it, you can drop search_ext_refs and simply go this way >> on ret == -ENOENT. > > I actually made it explicit on purpose. If you look at the function there''s > several places where ret can be set and then bubbled down. Instead of > depending on the called functions never setting -ENOENT (and future > programmers to get that it''s a ''magic'' value), I prefer to just explicitely > set a single int variable.Okay, sounds reasonable.>>> + /* >> ^ >> Catched by accident: trailing whitespace. > > Oh thanks. > >> >>> + * No refs were found, or we could not find the >>> + * name in our ref array. Find and remove the extended >>> + * inode ref then. >>> + */ >>> + return btrfs_del_inode_extref(trans, root, name, name_len, >>> + inode_objectid, ref_objectid, index); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans, >>> + struct btrfs_root *root, >>> + const char *name, int name_len, >>> + u64 inode_objectid, u64 ref_objectid, u64 index) >> >> Are we missing a check against BTRFS_LINK_MAX in this function? > > No, the caller is supposed to check this, in theory before starting a > transaction to change the FS. So btrfs_link() does the checking for us.Is that worth a comment? Or is that clear to anybody familiar with such things anyway (in which case I''d not add that comment)?>>> +{ >>> + struct btrfs_path *path; >>> + struct btrfs_key key; >>> + struct btrfs_inode_extref *ref; >> extref >> >>> + unsigned long ptr; >>> + int ret; >>> + int ins_len = name_len + sizeof(*ref); >>> + >>> + key.objectid = inode_objectid; >>> + key.type = BTRFS_INODE_EXTREF_KEY; >>> + key.offset = btrfs_extref_key_off(ref_objectid, name, name_len); >>> + >>> + path = btrfs_alloc_path(); >>> + if (!path) >>> + return -ENOMEM; >>> + >>> + path->leave_spinning = 1; >>> + ret = btrfs_insert_empty_item(trans, root, path, &key, >>> + ins_len); >>> + if (ret < 0) >>> + goto out; >>> + >>> + ref = btrfs_item_ptr(path->nodes[0], path->slots[0], >>> + struct btrfs_inode_extref); >>> + >>> + btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len); >>> + btrfs_set_inode_extref_index(path->nodes[0], ref, index); >>> + btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid); >>> + >>> + ptr = (unsigned long)&ref->name; >>> + write_extent_buffer(path->nodes[0], name, ptr, name_len); >>> + btrfs_mark_buffer_dirty(path->nodes[0]); >>> + >>> +out: >>> + btrfs_free_path(path); >>> return ret; >>> } >>> >>> @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans, >>> >>> out: >>> btrfs_free_path(path); >>> + >>> + if (ret == -EMLINK) { >>> + struct btrfs_super_block *disk_super = root->fs_info->super_copy; >>> + /* We ran out of space in the ref array. Need to >>> + * add an extended ref. */ >>> + if (btrfs_super_incompat_flags(disk_super) >>> + & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) >>> + ret = btrfs_insert_inode_extref(trans, root, name, >>> + name_len, >>> + inode_objectid, >>> + ref_objectid, index); >>> + } >>> + >>> return ret; >>> } >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 892b347..0ca525d 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, >>> struct btrfs_trans_handle *trans; >>> struct btrfs_root *root = BTRFS_I(dir)->root; >>> struct btrfs_path *path; >>> - struct btrfs_inode_ref *ref; >>> struct btrfs_dir_item *di; >>> struct inode *inode = dentry->d_inode; >>> u64 index; >>> @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir, >>> } >>> btrfs_release_path(path); >>> >>> - ref = btrfs_lookup_inode_ref(trans, root, path, >>> - dentry->d_name.name, dentry->d_name.len, >>> - ino, dir_ino, 0); >>> - if (IS_ERR(ref)) { >>> - err = PTR_ERR(ref); >>> + ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name, >>> + dentry->d_name.len, ino, dir_ino, 0, &index); >> >> This line is >80 chars. Please use scripts/checkpatch.pl to catch those. > > Fixed. > > >>> + if (ret) { >>> + err = ret; >>> goto out; >>> } >>> - BUG_ON(!ref); >>> + >>> if (check_path_shared(root, path)) >>> goto out; >>> - index = btrfs_inode_ref_index(path->nodes[0], ref); >>> + >>> btrfs_release_path(path); >>> >>> /* >>> @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, >>> btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY); >>> key[0].offset = 0; >>> >>> + /* >>> + * Start new inodes with a V1 ref. This is slightly more >>> + * efficient for small numbers of hard links. >>> + */ >> >> I''d drop that comment. extref is no "V2" kind of ref. But you can leave >> it in if you feel it helps anyone later. > > Yeah at one point they were called ''v2'' due to developer laziness :) > > I''ve updated the comment: > > /* > * Start new inodes with an inode_ref. This is slightly more > * efficient for small numbers of hard links since they will > * be packed into one item. Extended refs will kick in if we > * add more hard links than can fit in the ref item. > */ > > > It could probably still be improved beyond that. Mostly I want to point out > that we never start with extended refs so that anyone looking at the code > will understand the ''flow'' of btrfs inode references.I think its fine like it stands above. -Jan>>> key[1].objectid = objectid; >>> btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY); >>> key[1].offset = ref_objectid; >>> @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, >>> if (root->objectid != BTRFS_I(inode)->root->objectid) >>> return -EXDEV; >>> >>> - if (inode->i_nlink == ~0U) >>> + if (inode->i_nlink >= BTRFS_LINK_MAX) >>> return -EMLINK; >>> >>> err = btrfs_set_inode_index(dir, &index); >> >> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> > > Thanks again Jan. I''ll be sure to CC you on my next drop! > --Mark > > -- > Mark Fasheh-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 05:53:15PM +0200, Jan Schmidt wrote:> Hi Mark, > > While reading 3/3 I stumbled across one more thing in this one: > > On 05.04.2012 22:09, Mark Fasheh wrote: > > +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, > > + u64 start_off, struct btrfs_path *path, > > + struct btrfs_inode_extref **ret_ref, u64 *found_off) > > +{ > > + int ret, slot; > > + struct btrfs_key key, found_key; > > + struct btrfs_inode_extref *ref; > > + struct extent_buffer *leaf; > > + struct btrfs_item *item; > > + unsigned long ptr; > > > > -/* > > - * There are a few corners where the link count of the file can''t > > - * be properly maintained during replay. So, instead of adding > > - * lots of complexity to the log code, we just scan the backrefs > > - * for any file that has been through replay. > > - * > > - * The scan will update the link count on the inode to reflect the > > - * number of back refs found. If it goes down to zero, the iput > > - * will free the inode. > > - */ > > -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > > - struct btrfs_root *root, > > - struct inode *inode) > > + key.objectid = inode_objectid; > > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > > + key.offset = start_off; > > + > > + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > > + if (ret < 0) > > + goto out; > > + > > + while (1) { > > + leaf = path->nodes[0]; > > + slot = path->slots[0]; > > + if (slot >= btrfs_header_nritems(leaf)) { > > + /* > > + * If the item at offset is not found, > > + * btrfs_search_slot will point us to the slot > > + * where it should be inserted. In our case > > + * that will be the slot directly before the > > + * next INODE_REF_KEY_V2 item. In the case > > + * that we''re pointing to the last slot in a > > + * leaf, we must move one leaf over. > > + */ > > + ret = btrfs_next_leaf(root, path); > > + if (ret) { > > + if (ret >= 1) > > + ret = -ENOENT; > > + break; > > + } > > + continue; > > + } > > + > > + item = btrfs_item_nr(leaf, slot); > > + btrfs_item_key_to_cpu(leaf, &found_key, slot); > > + > > + /* > > + * Check that we''re still looking at an extended ref key for > > + * this particular objectid. If we have different > > + * objectid or type then there are no more to be found > > + * in the tree and we can exit. > > + */ > > + ret = -ENOENT; > > + if (found_key.objectid != inode_objectid) > > + break; > > + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY) > > + break; > > + > > + ret = 0; > > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > > + ref = (struct btrfs_inode_extref *)ptr; > > + *ret_ref = ref; > > + if (found_off) > > + *found_off = found_key.offset + 1; > ^^^ > It''s evil to call it "found offset" an then return one larger than the > offset found. No caller would ever expect this. > > > + break; > > + } > > + > > +out: > > + return ret; > > +} > > + > > +static int count_inode_extrefs(struct btrfs_root *root, > > + struct inode *inode, struct btrfs_path *path) > > +{ > > + int ret; > > + unsigned int nlink = 0; > > + u64 inode_objectid = btrfs_ino(inode); > > + u64 offset = 0; > > + struct btrfs_inode_extref *ref; > > + > > + while (1) { > > + ret = btrfs_find_one_extref(root, inode_objectid, offset, path, > > + &ref, &offset); > > + if (ret) > > + break; > > + > > + nlink++; > > + offset++; > ^^^^^^^^ > Huh. See? The caller expected to get the offset found from > btrfs_find_one_extref. As it stands you might be missing the very next key.Oh yeah that was totally broken btw. Fixed now, but forgot to mention that :) btrfs_find_one_extref() now returns the actual found offset (and the callers can just increment if they''re doing a search). --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 03:08:35PM +0200, Jan Schmidt wrote:> On 05.04.2012 22:09, Mark Fasheh wrote: > > Teach tree-log.c about extended inode refs. In particular, we have to adjust > > the behavior of inode ref replay as well as log tree recovery to account for > > the existence of extended refs. > > > > Signed-off-by: Mark Fasheh <mfasheh@suse.de> > > --- > > fs/btrfs/tree-log.c | 320 +++++++++++++++++++++++++++++++++++++++++--------- > > fs/btrfs/tree-log.h | 4 + > > 2 files changed, 266 insertions(+), 58 deletions(-) > > > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > > index 966cc74..d69b07a 100644 > > --- a/fs/btrfs/tree-log.c > > +++ b/fs/btrfs/tree-log.c > > @@ -23,6 +23,7 @@ > > #include "disk-io.h" > > #include "locking.h" > > #include "print-tree.h" > > +#include "backref.h" > > So now tree-log.c includes backref.h and backref.c includes tree-log.h, > this is not a problem by itself, but I''d try to avoid such dependencies. > I''d put find_one_extref over to backref.c to solve this, also because > there it would be closer to inode_ref_info (which does the same for > INODE_REFs.Yeah good idea. I went ahead and moved find_one_extref to backref.c as you suggest.> > @@ -786,7 +804,6 @@ out: > > return match; > > } > > > > - > > /* > > * replay one inode back reference item found in the log tree. > > * eb, slot and key refer to the buffer and key found in the log tree. > > @@ -801,15 +818,20 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, > > struct btrfs_key *key) > > { > > struct btrfs_inode_ref *ref; > > + struct btrfs_inode_extref *extref; > > struct btrfs_dir_item *di; > > + struct btrfs_key search_key; > > You don''t need search_key, just use key from the parameter list as the > code did previously.I force the value of search key (to look for refs) however while the key value from the parameter list could be of any kind of inode ref.> > struct inode *dir; > > struct inode *inode; > > unsigned long ref_ptr; > > unsigned long ref_end; > > - char *name; > > - int namelen; > > + char *name, *victim_name; > > + int namelen, victim_name_len; > > split > > > int ret; > > int search_done = 0; > > + int log_ref_ver = 0; > > + u64 parent_objectid, inode_objectid, ref_index; > > splitdone, thanks.> > + struct extent_buffer *leaf; > > > > /* > > * it is possible that we didn''t log all the parent directories > > @@ -817,32 +839,56 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, > > * copy the back ref in. The link count fixup code will take > > * care of the rest > > */ > > - dir = read_one_inode(root, key->offset); > > + > > + if (key->type == BTRFS_INODE_EXTREF_KEY) { > > + log_ref_ver = 1; > ^^^^^^^^^^^^^^^ > Assigned but never used.Almost got rid of this but it turns out after some changes, I''m using it now :)> > + > > + ref_ptr = btrfs_item_ptr_offset(eb, slot); > > + > > + /* So that we don''t loop back looking for old style log refs. */ > > + ref_end = ref_ptr; > > + > > + extref = (struct btrfs_inode_extref *) btrfs_item_ptr_offset(eb, slot); > > + namelen = btrfs_inode_extref_name_len(eb, extref); > > + name = kmalloc(namelen, GFP_NOFS); > > kmalloc may fail.Fixed both instances of this. I''m just testing for null return from kmalloc and bubbling the -ENOMEM back up. The callers of add_inode_ref() will wind up BUGing on us anyway but that''s beyond the scope of this patch.> > + > > + read_extent_buffer(eb, name, (unsigned long)&extref->name, > > + namelen); > > + > > + ref_index = btrfs_inode_extref_index(eb, extref); > > + parent_objectid = btrfs_inode_extref_parent(eb, extref); > > + } else { > > + parent_objectid = key->offset; > > + > > + ref_ptr = btrfs_item_ptr_offset(eb, slot); > > + ref_end = ref_ptr + btrfs_item_size_nr(eb, slot); > > + > > + ref = (struct btrfs_inode_ref *)ref_ptr; > > + namelen = btrfs_inode_ref_name_len(eb, ref); > > + name = kmalloc(namelen, GFP_NOFS); > > + BUG_ON(!name); > > + > > + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); > > + > > + ref_index = btrfs_inode_ref_index(eb, ref); > > The way you put it, you had to copy all the code from the else block > down before the "goto again" case. Code duplication is error prone > > Maybe you can manage to put everything between "again:" and "goto again" > into a separate function that can be called multiple times. For the old > INODE_REF items that could be done in a loop then. This would also split > an overly long function into more readable units.Hmm, ok I pushed the variable setting code into a function for each type of ref. That cleans things up and keeps the code centralized.> > + /* Same search but for extended refs */ > > + extref = btrfs_lookup_inode_extref(NULL, root, path, name, namelen, > > + inode_objectid, parent_objectid, 0, > > + 0); > > + if (extref && !IS_ERR(extref)) { > > if (!IS_ERR_OR_NULL(extref))Nice, thanks for the tip.> > + victim_name_len = btrfs_inode_extref_name_len(eb, extref); > > + victim_name = kmalloc(namelen, GFP_NOFS); > > + leaf = path->nodes[0]; > > + read_extent_buffer(eb, name, (unsigned long)&extref->name, namelen); > > + > > + search_key.objectid = inode_objectid; > > + search_key.type = BTRFS_INODE_EXTREF_KEY; > > + search_key.offset = btrfs_extref_key_off(parent_objectid, name, namelen); > > + if (!backref_in_log(log, &search_key, victim_name, > > + victim_name_len)) { > > + btrfs_inc_nlink(inode); > > + btrfs_release_path(path); > > + > > + ret = btrfs_unlink_inode(trans, root, dir, > > + inode, victim_name, > > + victim_name_len); > > + } > > + kfree(victim_name); > > + BUG_ON(ret); > > + } > > + > > + /* > > + * NOTE: we have searched root tree and checked the > > + * coresponding refs, it does not need to be checked again. > > + */ > > + search_done = 1; > > You moved this comment and assignment out of the "if (ret == 0)" case. > I''m not sure if this is still doing exactly the same now (?). > Previously, we were executing another btrfs_search_slot, > btrfs_lookup_dir_index_item, ... after the "goto again" case, which > would be skipped with this patch.Hmm, ok you''re definitely right that the search_done line there is broken. Come to think of it, I''m not quite sure what the meaning of that tiny bit of code was. I''ll come back to this one once I''ve looked closer.> > + > > /* look for a conflicting sequence number */ > > di = btrfs_lookup_dir_index_item(trans, root, path, btrfs_ino(dir), > > - btrfs_inode_ref_index(eb, ref), > > - name, namelen, 0); > > + ref_index, name, namelen, 0); > > if (di && !IS_ERR(di)) { > > ret = drop_one_dir_item(trans, root, path, dir, di); > > BUG_ON(ret); > > @@ -932,17 +1007,25 @@ again: > > > > insert: > > /* insert our name */ > > - ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, > > - btrfs_inode_ref_index(eb, ref)); > > + ret = btrfs_add_link(trans, dir, inode, name, namelen, 0, ref_index); > > BUG_ON(ret); > > > > btrfs_update_inode(trans, root, inode); > > > > out: > > - ref_ptr = (unsigned long)(ref + 1) + namelen; > > + ref_ptr = (unsigned long)(ref_ptr + sizeof(struct btrfs_inode_ref)) + namelen; > > kfree(name); > > - if (ref_ptr < ref_end) > > + if (ref_ptr < ref_end) { > > + ref = (struct btrfs_inode_ref *)ref_ptr; > > + namelen = btrfs_inode_ref_name_len(eb, ref); > > + name = kmalloc(namelen, GFP_NOFS); > > + BUG_ON(!name); > > + > > + read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen); > > + > > + ref_index = btrfs_inode_ref_index(eb, ref); > > This is the duplicated code mentioned above. > > > goto again; > > + } > > > > /* finally write the back reference in the inode */ > > ret = overwrite_item(trans, root, path, eb, slot, key); > > @@ -965,25 +1048,103 @@ static int insert_orphan_item(struct btrfs_trans_handle *trans, > > return ret; > > } > > > > +int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, > > + u64 start_off, struct btrfs_path *path, > > + struct btrfs_inode_extref **ret_ref, u64 *found_off) > > ret_extreffixed.> > > +{ > > + int ret, slot; > > + struct btrfs_key key, found_key; > > splitfixed.> > + struct btrfs_inode_extref *ref; > > + struct extent_buffer *leaf; > > + struct btrfs_item *item; > > + unsigned long ptr; > > > > -/* > > - * There are a few corners where the link count of the file can''t > > - * be properly maintained during replay. So, instead of adding > > - * lots of complexity to the log code, we just scan the backrefs > > - * for any file that has been through replay. > > - * > > - * The scan will update the link count on the inode to reflect the > > - * number of back refs found. If it goes down to zero, the iput > > - * will free the inode. > > - */ > > -static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > > - struct btrfs_root *root, > > - struct inode *inode) > > [side note: this is really strange patch alignment. my git is doing it > the same way.] > > > + key.objectid = inode_objectid; > > + btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY); > > + key.offset = start_off; > > + > > + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > > + if (ret < 0) > > + goto out; > > Just return here and drop the out label.Done.> > > + > > + while (1) { > > + leaf = path->nodes[0]; > > + slot = path->slots[0]; > > + if (slot >= btrfs_header_nritems(leaf)) { > > + /* > > + * If the item at offset is not found, > > + * btrfs_search_slot will point us to the slot > > + * where it should be inserted. In our case > > + * that will be the slot directly before the > > + * next INODE_REF_KEY_V2 item. In the case > > + * that we''re pointing to the last slot in a > > + * leaf, we must move one leaf over. > > + */ > > + ret = btrfs_next_leaf(root, path); > > + if (ret) { > > + if (ret >= 1) > > + ret = -ENOENT; > > + break; > > + } > > This large block (and its comment) can be replaced by > btrfs_search_slot_for_read with find_higher = 1, return_any = 0. This > also avoids the "continue" and we even git rid of the whole while (1) > loop here.Ok. I don''t have that in my tree but presumably it''s a very new function I just need to pick up with an update...> > + continue; > > + } > > + > > + item = btrfs_item_nr(leaf, slot); > > + btrfs_item_key_to_cpu(leaf, &found_key, slot); > > + > > + /* > > + * Check that we''re still looking at an extended ref key for > > + * this particular objectid. If we have different > > + * objectid or type then there are no more to be found > > + * in the tree and we can exit. > > + */ > > + ret = -ENOENT; > > + if (found_key.objectid != inode_objectid) > > + break; > > + if (btrfs_key_type(&found_key) != BTRFS_INODE_EXTREF_KEY) > > + break; > > + > > + ret = 0; > > + ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); > > + ref = (struct btrfs_inode_extref *)ptr; > > + *ret_ref = ref; > > + if (found_off) > > + *found_off = found_key.offset + 1; > > + break; > > + } > > + > > +out: > > + return ret; > > +} > > + > > +static int count_inode_extrefs(struct btrfs_root *root, > > + struct inode *inode, struct btrfs_path *path) > > +{ > > + int ret; > > + unsigned int nlink = 0; > > + u64 inode_objectid = btrfs_ino(inode); > > + u64 offset = 0; > > + struct btrfs_inode_extref *ref; > > + > > + while (1) { > > + ret = btrfs_find_one_extref(root, inode_objectid, offset, path, > > + &ref, &offset); > > + if (ret) > > + break; > > Assume the first call to btrfs_find_ione_extref returns -EIO. Do we > really want count_inode_extrefs return 0 here? I agree that the previous > code suffers from the same problem, but still: it''s a problem.Yeah as you note, I''m just keeping the same behavior as before. This I think is probably a question for Chris...> > + > > + nlink++; > > + offset++; > > + } > > + > > + return nlink; > > +} > > + > > +static int count_inode_refs(struct btrfs_root *root, > > + struct inode *inode, struct btrfs_path *path) > > { > > - struct btrfs_path *path; > > int ret; > > struct btrfs_key key; > > - u64 nlink = 0; > > + unsigned int nlink = 0; > > unsigned long ptr; > > unsigned long ptr_end; > > int name_len; > > @@ -993,10 +1154,6 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > > key.type = BTRFS_INODE_REF_KEY; > > key.offset = (u64)-1; > > > > - path = btrfs_alloc_path(); > > - if (!path) > > - return -ENOMEM; > > - > > while (1) { > > ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > > if (ret < 0) > > @@ -1030,6 +1187,48 @@ static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > > btrfs_release_path(path); > > } > > btrfs_release_path(path); > > + btrfs_free_path(path); > > In general, you must not free a path when you don''t allocate it.Ok yeah that one must''ve gotten past me. Fixed.> > > + > > + return nlink; > > +} > > + > > +/* > > + * There are a few corners where the link count of the file can''t > > + * be properly maintained during replay. So, instead of adding > > + * lots of complexity to the log code, we just scan the backrefs > > + * for any file that has been through replay. > > + * > > + * The scan will update the link count on the inode to reflect the > > + * number of back refs found. If it goes down to zero, the iput > > + * will free the inode. > > + */ > > +static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans, > > + struct btrfs_root *root, > > + struct inode *inode) > > +{ > > + struct btrfs_path *path; > > + int ret; > > + u64 nlink = 0; > > + u64 ino = btrfs_ino(inode); > > + > > + path = btrfs_alloc_path(); > > + if (!path) > > + return -ENOMEM; > > + > > + ret = count_inode_refs(root, inode, path); > > + btrfs_release_path(path); > > Either count_inode_refs should alloc it''s private path, or it should > return the path in a released state. The caller should not be > responsible to pass a clean path and release it afterwards, as > count_inode_refs does not return anything through the path. > > > + if (ret < 0) > > + goto out; > > + > > + nlink = ret; > > + > > + ret = count_inode_extrefs(root, inode, path); > > + btrfs_release_path(path); > > Same here. I''d just put the btrfs_release_path into count_inode_(ext)refs.Yeah, fixed those up. Thanks again Jan! --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 03, 2012 at 04:12:21PM -0700, Mark Fasheh wrote:> > > + > > > + ref_ptr = btrfs_item_ptr_offset(eb, slot); > > > + > > > + /* So that we don''t loop back looking for old style log refs. */ > > > + ref_end = ref_ptr; > > > + > > > + extref = (struct btrfs_inode_extref *) btrfs_item_ptr_offset(eb, slot); > > > + namelen = btrfs_inode_extref_name_len(eb, extref); > > > + name = kmalloc(namelen, GFP_NOFS); > > > > kmalloc may fail. > > Fixed both instances of this. I''m just testing for null return from kmalloc > and bubbling the -ENOMEM back up. The callers of add_inode_ref() will wind > up BUGing on us anyway but that''s beyond the scope of this patch.Yes, this is consistent with the rest of no-mem handling. Fixing all caller paths is not always trivial and one does not want to do it during a patch development. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, comments inline as usual! On Thu, Apr 12, 2012 at 07:59:36PM +0200, Jan Schmidt wrote:> > @@ -858,62 +859,75 @@ static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, > > } > > > > /* > > - * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements > > - * of the path are separated by ''/'' and the path is guaranteed to be > > - * 0-terminated. the path is only given within the current file system. > > - * Therefore, it never starts with a ''/''. the caller is responsible to provide > > - * "size" bytes in "dest". the dest buffer will be filled backwards. finally, > > - * the start point of the resulting string is returned. this pointer is within > > - * dest, normally. > > - * in case the path buffer would overflow, the pointer is decremented further > > - * as if output was written to the buffer, though no more output is actually > > - * generated. that way, the caller can determine how much space would be > > - * required for the path to fit into the buffer. in that case, the returned > > - * value will be smaller than dest. callers must check this! > > + * Given the parent objectid and name/name_len pairs of an inode ref > > + * (any version) this iterates to turn that information into a > > + * full filesystem path. elements of the path are separated by ''/'' and > > + * the path is guaranteed to be 0-terminated. the path is only given > > + * within the current file system. Therefore, it never starts with a > > + * ''/''. the caller is responsible to provide "size" bytes in > > + * "dest". the dest buffer will be filled backwards. finally, the > > + * start point of the resulting string is returned. this pointer is > > + * within dest, normally. in case the path buffer would overflow, the > > + * pointer is decremented further as if output was written to the > > + * buffer, though no more output is actually generated. that way, the > > + * caller can determine how much space would be required for the path > > + * to fit into the buffer. in that case, the returned value will be > > + * smaller than dest. callers must check this! > > It would reduce patch sets if you can extend comments in a compatible > way, you make reviewers happy if you don''t realign text (or, later, > function parameters) where it''s not required.Yeah I just reverted the comment change as it''s no longer needed anyway.> > */ > > static char *iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, > > - struct btrfs_inode_ref *iref, > > - struct extent_buffer *eb_in, u64 parent, > > - char *dest, u32 size) > > + int name_len, unsigned long name_off, > > name_len should be u32hmm ok that''s fine.> > + struct extent_buffer *eb_in, u64 parent, > > + char *dest, u32 size) > > { > > - u32 len; > > int slot; > > u64 next_inum; > > int ret; > > s64 bytes_left = size - 1; > > struct extent_buffer *eb = eb_in; > > struct btrfs_key found_key; > > + struct btrfs_inode_ref *iref; > > + struct btrfs_inode_extref *iref2; > > iextrefdone.> > if (bytes_left >= 0) > > dest[bytes_left] = ''\0''; > > > > while (1) { > > - len = btrfs_inode_ref_name_len(eb, iref); > > - bytes_left -= len; > > + bytes_left -= name_len; > > if (bytes_left >= 0) > > read_extent_buffer(eb, dest + bytes_left, > > - (unsigned long)(iref + 1), len); > > + name_off, name_len); > > if (eb != eb_in) > > free_extent_buffer(eb); > > + > > + /* Ok, we have enough to find any refs to the parent inode. */ > > ret = inode_ref_info(parent, 0, fs_root, path, &found_key); > > - if (ret > 0) > > - ret = -ENOENT; > > - if (ret) > > - break; > > next_inum = found_key.offset; > > + if (ret == 0) { > > + slot = path->slots[0]; > > + eb = path->nodes[0]; > > + /* make sure we can use eb after releasing the path */ > > + if (eb != eb_in) > > + atomic_inc(&eb->refs); > > + btrfs_release_path(path); > > + iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); > > + > > + name_len = btrfs_inode_ref_name_len(eb, iref); > > + name_off = (unsigned long)(iref + 1); > > + } else { > > + ret = btrfs_find_one_extref(fs_root, parent, 0, path, > > + &iref2, NULL); > > + if (ret) > > + break; > > + > > + next_inum = btrfs_inode_extref_parent(eb, iref2); > > + name_off = (unsigned long)&iref2->name; > > + name_len = btrfs_inode_extref_name_len(eb, iref2); > > + } > > > > /* regular exit ahead */ > > if (parent == next_inum) > > break; > > These regular exit lines must go before the block you inserted. > Otherwise we leak a reference on eb if it''s != eb_in.Good catch, fixed by virtue of fixing the below issue :)> Whereas I think we don''t need this if-else-construct at all. We do need > the changes you made as to passing name_len and name_off, I agree. > However, the rest of the function should stay as it was, because the > parent of each object must be a directory and a directory won''t have > hard links. Thus, we''ll never meet INODE_EXTREFs when walking up the > path. Or did I miss something?You didn''t miss anything, at the time I coded this part I wasn''t 100% sure if we would use inode refs on a directory for something *other* than hardlinking a dir (like, possibly some internal use I wasn''t aware of). I''m pretty confident now that this can''t happen so I think the best approach is to just kill the code that''s handling this condition and leave a nice comment.> > > > - slot = path->slots[0]; > > - eb = path->nodes[0]; > > - /* make sure we can use eb after releasing the path */ > > - if (eb != eb_in) > > - atomic_inc(&eb->refs); > > - btrfs_release_path(path); > > - > > - iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); > > parent = next_inum; > > --bytes_left; > > if (bytes_left >= 0) > > @@ -1226,9 +1240,9 @@ int iterate_inodes_from_logical(u64 logical, struct btrfs_fs_info *fs_info, > > return ret; > > } > > > > -static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > > - struct btrfs_path *path, > > - iterate_irefs_t *iterate, void *ctx) > > +static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root, > > + struct btrfs_path *path, > > + iterate_irefs_t *iterate, void *ctx) > > This function must not call free_extent_buffer(eb) in line 1306 after > applying your patch set (immediately before the break). Second, I think > we''d better add a blocking read lock on eb after incrementing it''s > refcount, because we need the current content to stay as it is. Both > isn''t part of your patches, but it might be easier if you make that > bugfix change as a 3/4 patch within your set and turn this one into 4/4. > If you don''t like that, I''ll send a separate patch for it. Don''t miss > the unlock if you do it ;-)Ok, I think I was able to figure out and add the correct locking calls. Basically I believe I need to wrap access around: btrfs_tree_read_lock(eb); btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); <read eb contents> btrfs_tree_read_unlock_blocking(eb);> > { > > int ret; > > int slot; > > @@ -1244,7 +1258,7 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > > > > while (1) { > > ret = inode_ref_info(inum, parent ? parent+1 : 0, fs_root, path, > > - &found_key); > > + &found_key); > > if (ret < 0) > > break; > > if (ret) { > > @@ -1286,6 +1300,76 @@ static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > > return ret; > > } > > > > +static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, > > + struct btrfs_path *path, > > + iterate_extrefs_t *iterate, void *ctx) > > +{ > > + int ret; > > + int slot; > > + u64 offset = 0; > > + u64 parent; > > + int found = 0; > > + struct extent_buffer *eb; > > + struct btrfs_item *item; > > + struct btrfs_inode_extref *iref2; > > iextrefDone.> > + > > + while (1) { > > + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, > > + &offset); > > + if (ret < 0) > > + break; > > + if (ret) { > > + ret = found ? 0 : -ENOENT; > > + break; > > + } > > + ++found; > > + > > + slot = path->slots[0]; > > + eb = path->nodes[0]; > > + /* make sure we can use eb after releasing the path */ > > + atomic_inc(&eb->refs); > > You need a blocking read lock here, too. Grab it before releasing the path.Done.> > > + btrfs_release_path(path); > > + > > + item = btrfs_item_nr(eb, slot); > > You don''t need item.Removed.> > > + iref2 = btrfs_item_ptr(eb, slot, struct btrfs_inode_extref); > > + > > + parent = btrfs_inode_extref_parent(eb, iref2); > > + ret = iterate(parent, iref2, eb, ctx); > > The caller shouldn''t have to deal with two different types of callbacks. > Please just build a dummy struct btrfs_inode_ref object here and pass it > to iterate. > Alternatively, you can extract the information the caller > will need, here, and pass that instead of a struct btrfs_inode_ref. This > way, we can use the same type for both iterate() functions.Ask and you shall receive :) It turns out to be pretty trivial to just pass in name_len and the name pointer as that''s all those callbacks needed from the ref.> > + if (ret) { > > + free_extent_buffer(eb); > > + break; > > + } > > + > > + free_extent_buffer(eb); > > Call free_extent_buffer(eb) before the if (ret), drop it from the if > block, add an unlock before the if block.Fixed that.> > + offset++; > > Another caller not expecting btrfs_find_one_extref to return offset > incremented. The offset++ should stay here and btrfs_find_one_extref > should just return the plain offset. > > > + } > > + > > + btrfs_release_path(path); > > + > > + return ret; > > +} > > + > > +static int iterate_irefs(u64 inum, struct btrfs_root *fs_root, > > + struct btrfs_path *path, > > + iterate_irefs_t *iterate, > > + iterate_extrefs_t *iterate2, void *ctx) > > As mentioned above, I''d like to see only be a single iterate function at > this level. > > > +{ > > + int ret, found_refs = 0; > > splitdone.> > + > > + ret = iterate_inode_refs(inum, fs_root, path, iterate, ctx); > > + if (ret && ret != -ENOENT) > > + return ret; > > + > > + if (ret != -ENOENT) > > + ++found_refs; > > I''d make those 2 if statements: > > if (!ret) > ++found_refs; > else if (ret != -ENOENT) > return ret;done.> > + > > +/* > > + * returns 0 if the path could be dumped (probably truncated) > > + * returns <0 in case of an error > > + */ > > +static int inode_to_path2(u64 inum, struct btrfs_inode_extref *iref2, > > + struct extent_buffer *eb, void *ctx) > > We''ll get rid of a second inode_to_path, too, if my suggestion works out.Yep. Thanks Jan! --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 08, 2012 at 03:57:39PM -0700, Mark Fasheh wrote:> Hi Jan, comments inline as usual! > > > This function must not call free_extent_buffer(eb) in line 1306 after > > applying your patch set (immediately before the break). Second, I think > > we''d better add a blocking read lock on eb after incrementing it''s > > refcount, because we need the current content to stay as it is. Both > > isn''t part of your patches, but it might be easier if you make that > > bugfix change as a 3/4 patch within your set and turn this one into 4/4. > > If you don''t like that, I''ll send a separate patch for it. Don''t miss > > the unlock if you do it ;-) > > Ok, I think I was able to figure out and add the correct locking calls. > > Basically I believe I need to wrap access around: > > btrfs_tree_read_lock(eb); > btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); > > <read eb contents> > > btrfs_tree_read_unlock_blocking(eb);You only need a blocking lock if you''re scheduling. Otherwise the spinlock variant is fine.> > > + > > > + while (1) { > > > + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, > > > + &offset); > > > + if (ret < 0) > > > + break; > > > + if (ret) { > > > + ret = found ? 0 : -ENOENT; > > > + break; > > > + } > > > + ++found; > > > + > > > + slot = path->slots[0]; > > > + eb = path->nodes[0]; > > > + /* make sure we can use eb after releasing the path */ > > > + atomic_inc(&eb->refs); > > > > You need a blocking read lock here, too. Grab it before releasing the path.If you''re calling btrfs_search_slot, it will give you a blocking lock on the leaf. If you set path->leave_spinning before the call, you''ll have a spinning lock on the leaf. If you unlock a block that you got from a path (like eb path->nodes[0]), the path structure has a flag for each level that indicates if that block was locked or not. See btrfs_release_path(). So, don''t fiddle the locks without fiddling the paths. You can switch from spinning to/from blocking without touching the path, it figures that out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 09, 2012 at 19:02 (+0200), Chris Mason wrote:>>>> + >>>> + while (1) { >>>> + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, >>>> + &offset); >>>> + if (ret < 0) >>>> + break; >>>> + if (ret) { >>>> + ret = found ? 0 : -ENOENT; >>>> + break; >>>> + } >>>> + ++found; >>>> + >>>> + slot = path->slots[0]; >>>> + eb = path->nodes[0]; >>>> + /* make sure we can use eb after releasing the path */ >>>> + atomic_inc(&eb->refs); >>> >>> You need a blocking read lock here, too. Grab it before releasing the path. > > If you''re calling btrfs_search_slot, it will give you a blocking lock > on the leaf. If you set path->leave_spinning before the call, you''ll > have a spinning lock on the leaf. > > If you unlock a block that you got from a path (like eb > path->nodes[0]), the path structure has a flag for each level that > indicates if that block was locked or not. See btrfs_release_path(). > So, don''t fiddle the locks without fiddling the paths. > > You can switch from spinning to/from blocking without touching the path, it > figures that out.Note that we''re releasing the path shortly after. My suggestion was to grab an *additional* read lock after atomic_inc(&eb->refs) (probably better extent_buffer_get(eb)) and before btrfs_path_release(). An alternative would be to set path->locks[0] to NULL, which would btrfs_release_path prevent from unlocking it, kidnapping its lock for our purpose. But I much prefer the open coded solution. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 10, 2012 at 10:23:28AM +0200, Jan Schmidt wrote:> On Wed, May 09, 2012 at 19:02 (+0200), Chris Mason wrote: > >>>> + > >>>> + while (1) { > >>>> + ret = btrfs_find_one_extref(fs_root, inum, offset, path, &iref2, > >>>> + &offset); > >>>> + if (ret < 0) > >>>> + break; > >>>> + if (ret) { > >>>> + ret = found ? 0 : -ENOENT; > >>>> + break; > >>>> + } > >>>> + ++found; > >>>> + > >>>> + slot = path->slots[0]; > >>>> + eb = path->nodes[0]; > >>>> + /* make sure we can use eb after releasing the path */ > >>>> + atomic_inc(&eb->refs); > >>> > >>> You need a blocking read lock here, too. Grab it before releasing the path. > > > > If you''re calling btrfs_search_slot, it will give you a blocking lock > > on the leaf. If you set path->leave_spinning before the call, you''ll > > have a spinning lock on the leaf. > > > > If you unlock a block that you got from a path (like eb > > path->nodes[0]), the path structure has a flag for each level that > > indicates if that block was locked or not. See btrfs_release_path(). > > So, don''t fiddle the locks without fiddling the paths. > > > > You can switch from spinning to/from blocking without touching the path, it > > figures that out. > > Note that we''re releasing the path shortly after. My suggestion was to > grab an *additional* read lock after atomic_inc(&eb->refs) (probably > better extent_buffer_get(eb)) and before btrfs_path_release(). > > An alternative would be to set path->locks[0] to NULL, which would > btrfs_release_path prevent from unlocking it, kidnapping its lock for > our purpose. But I much prefer the open coded solution.Thanks, I see now. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html