Hi gang, I finally sat down to fix that readdir hang that has been in the back of my mind for a while. I *hope* that the fix is pretty simple: just don''t manufacture a fake f_pos, I *think* we can abuse f_version as an indicator that we shouldn''t return entries. Does this look reasonable? We still have the problem that we can generate valid large f_pos values that can confuse 32bit userspace, but that''s a different problem. I think we''ll want filldir generation of EOVERFLOW like what exists for large inodes. The rest of the patches are cleanups that I saw when absorbing the code. It''s all lightly tested with xfstests but it wouldn''t surprise me if I missed something so review is appreciated. Thanks! - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The only time we need to advance f_pos is after we''ve successfully given a result to userspace via filldir. This simplification gets rid of the is_curr variable used to update f_pos for the delayed item readdir entries. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/delayed-inode.c | 5 +++-- fs/btrfs/inode.c | 17 +++++++---------- 2 files changed, 10 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 5615eac..4d846a2 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, continue; } - filp->f_pos = curr->key.offset; - di = (struct btrfs_dir_item *)curr->data; name = (char *)(di + 1); name_len = le16_to_cpu(di->name_len); @@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, over = filldir(dirent, name, name_len, curr->key.offset, location.objectid, d_type); + if (!over) + filp->f_pos = curr->key.offset + 1; + if (atomic_dec_and_test(&curr->refs)) kfree(curr); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b11a95e..6a5784b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, char tmp_name[32]; char *name_ptr; int name_len; - int is_curr = 0; /* filp->f_pos points to the current index? */ /* FIXME, use a real flag for deciding about the key type */ if (root->fs_info->tree_root == root) @@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, found_key.offset)) goto next; - filp->f_pos = found_key.offset; - is_curr = 1; - di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item); di_cur = 0; di_total = btrfs_item_size(leaf, item); @@ -5130,6 +5126,9 @@ skip: if (over) goto nopos; + + filp->f_pos = found_key.offset + 1; + di_len = btrfs_dir_name_len(leaf, di) + btrfs_dir_data_len(leaf, di) + sizeof(*di); di_cur += di_len; @@ -5140,23 +5139,21 @@ next: } if (key_type == BTRFS_DIR_INDEX_KEY) { - if (is_curr) - filp->f_pos++; ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir, &ins_list); if (ret) goto nopos; } - /* Reached end of directory/root. Bump pos past the last item. */ - if (key_type == BTRFS_DIR_INDEX_KEY) + /* Reached end of directory/root */ + if (key_type == BTRFS_DIR_INDEX_KEY) { /* * 32-bit glibc will use getdents64, but then strtol - * so the last number we can serve is this. */ filp->f_pos = 0x7fffffff; - else - filp->f_pos++; + } + nopos: ret = 0; err: -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zach Brown
2013-Jun-04 22:17 UTC
[PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX
To work around bugs in userspace btrfs_real_readdir() sets f_pos to an offset that will prevent any future entries from being returned once the last entry is hit. Over time this supposedly impossible offset was decreased from the initial U64_MAX to INT_MAX to appease 32bit userspace. https://oss.oracle.com/pipermail/btrfs-devel/2008-January/000437.html commit c2a8b6e11009398ca9363d8ba8d4e7e40fb897fd commit 89f135d8b53bcccafd91a075366d2704ba257cf3 commit 406266ab9ac8ed8b085c58aacd9e3161480dc5d5 The remaining problem is that resetting f_pos to some impossible offset causes userspace to spin when it''s, well, possible for an entry to have that offset. It takes a single thread on a modern cpu about nine hours of constant file creation and removal to hit an offset past INT_MAX on a single spindle. Instead of trying to find an impossible f_pos that doesn''t break various layers of the stack, let''s use f_version to indicate that readdir should stop returning entries until seek changes f_pos and clears f_version. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/inode.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6a5784b..e6e2b86 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4983,6 +4983,16 @@ unsigned char btrfs_filetype_table[] = { DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK }; +/* + * There have been buggy applications that can''t handle one readdir pass + * returning the same name for different inodes that are unlinked and + * re-created during the readdir pass. This was partially worked around + * by trying to set f_pos to magic values that broke either 32bit userspace + * or entries with huge offsets. Now we set f_version to a magic value + * which prevents readdir results until seek resets f_pos and f_version. + */ +#define BTRFS_READDIR_EOF ~0ULL + static int btrfs_real_readdir(struct file *filp, void *dirent, filldir_t filldir) { @@ -5008,6 +5018,9 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, char *name_ptr; int name_len; + if (filp->f_version == BTRFS_READDIR_EOF) + return 0; + /* FIXME, use a real flag for deciding about the key type */ if (root->fs_info->tree_root == root) key_type = BTRFS_DIR_ITEM_KEY; @@ -5145,14 +5158,9 @@ next: goto nopos; } - /* Reached end of directory/root */ - if (key_type == BTRFS_DIR_INDEX_KEY) { - /* - * 32-bit glibc will use getdents64, but then strtol - - * so the last number we can serve is this. - */ - filp->f_pos = 0x7fffffff; - } + /* prevent further readdir results without seeking once we hit EOF */ + if (key_type == BTRFS_DIR_INDEX_KEY) + filp->f_version = BTRFS_READDIR_EOF; nopos: ret = 0; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zach Brown
2013-Jun-04 22:17 UTC
[PATCH 3/6] btrfs: trivial delayed item readdir list cleanups
Just call btrfs_put_delayed_items() for each list rather than having two list arguments and duplicated code. list_for_each_entry_safe() can handle an empty list. We don''t have to conditionally use and tear down the lists if we always initialize them to be empty. They''re only populated when needed and the rest of the uses just find empty lists. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/delayed-inode.c | 17 ++--------------- fs/btrfs/delayed-inode.h | 3 +-- fs/btrfs/inode.c | 17 ++++++----------- 3 files changed, 9 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 4d846a2..bc753fc 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1618,18 +1618,11 @@ void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, atomic_dec(&delayed_node->refs); } -void btrfs_put_delayed_items(struct list_head *ins_list, - struct list_head *del_list) +void btrfs_put_delayed_items(struct list_head *list) { struct btrfs_delayed_item *curr, *next; - list_for_each_entry_safe(curr, next, ins_list, readdir_list) { - list_del(&curr->readdir_list); - if (atomic_dec_and_test(&curr->refs)) - kfree(curr); - } - - list_for_each_entry_safe(curr, next, del_list, readdir_list) { + list_for_each_entry_safe(curr, next, list, readdir_list) { list_del(&curr->readdir_list); if (atomic_dec_and_test(&curr->refs)) kfree(curr); @@ -1642,9 +1635,6 @@ int btrfs_should_delete_dir_index(struct list_head *del_list, struct btrfs_delayed_item *curr, *next; int ret; - if (list_empty(del_list)) - return 0; - list_for_each_entry_safe(curr, next, del_list, readdir_list) { if (curr->key.offset > index) break; @@ -1679,9 +1669,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, int over = 0; unsigned char d_type; - if (list_empty(ins_list)) - return 0; - /* * Changing the data of the delayed item is impossible. So * we needn''t lock them. And we have held i_mutex of the diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 1d5c5f7..573506b 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -135,8 +135,7 @@ void btrfs_destroy_delayed_inodes(struct btrfs_root *root); /* Used for readdir() */ void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, struct list_head *del_list); -void btrfs_put_delayed_items(struct list_head *ins_list, - struct list_head *del_list); +void btrfs_put_delayed_items(struct list_head *list); int btrfs_should_delete_dir_index(struct list_head *del_list, u64 index); int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6e2b86..53a8696 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5003,8 +5003,8 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, struct btrfs_key key; struct btrfs_key found_key; struct btrfs_path *path; - struct list_head ins_list; - struct list_head del_list; + LIST_HEAD(ins_list); + LIST_HEAD(del_list); int ret; struct extent_buffer *leaf; int slot; @@ -5048,11 +5048,8 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, path->reada = 1; - if (key_type == BTRFS_DIR_INDEX_KEY) { - INIT_LIST_HEAD(&ins_list); - INIT_LIST_HEAD(&del_list); + if (key_type == BTRFS_DIR_INDEX_KEY) btrfs_get_delayed_items(inode, &ins_list, &del_list); - } btrfs_set_key_type(&key, key_type); key.offset = filp->f_pos; @@ -5083,9 +5080,7 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, break; if (found_key.offset < filp->f_pos) goto next; - if (key_type == BTRFS_DIR_INDEX_KEY && - btrfs_should_delete_dir_index(&del_list, - found_key.offset)) + if (btrfs_should_delete_dir_index(&del_list, found_key.offset)) goto next; di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item); @@ -5165,8 +5160,8 @@ next: nopos: ret = 0; err: - if (key_type == BTRFS_DIR_INDEX_KEY) - btrfs_put_delayed_items(&ins_list, &del_list); + btrfs_put_delayed_items(&ins_list); + btrfs_put_delayed_items(&del_list); btrfs_free_path(path); return ret; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zach Brown
2013-Jun-04 22:17 UTC
[PATCH 4/6] btrfs: simplify finding next/prev delayed items
I''d like to use the currently unused next/prev arguments to __btrfs_lookup_delayed_item() in a future patch. I noticed that the code could be simplified. We don''t need to use rb_next() or rb_prev() to walk back up the tree once we''ve failed to find the key at a leaf. We can record the most recent next and prev items as we descend and return those when the search fails. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/delayed-inode.c | 54 +++++++++++++++++++----------------------------- 1 file changed, 21 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index bc753fc..67e0f9f 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -326,11 +326,11 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len) * __btrfs_lookup_delayed_item - look up the delayed item by key * @delayed_node: pointer to the delayed node * @key: the key to look up - * @prev: used to store the prev item if the right item isn''t found - * @next: used to store the next item if the right item isn''t found + * @prev: set to the item before @key when it isn''t found + * @next: set to the item after @key when it isn''t found * - * Note: if we don''t find the right item, we will return the prev item and - * the next item. + * @prev and @next are only correct if the search terminates at a leaf so + * they''re only set if an item with @key is not found. */ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( struct rb_root *root, @@ -338,48 +338,36 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( struct btrfs_delayed_item **prev, struct btrfs_delayed_item **next) { - struct rb_node *node, *prev_node = NULL; + struct rb_node *node = root->rb_node; + struct btrfs_delayed_item *prev_item = NULL; + struct btrfs_delayed_item *next_item = NULL; struct btrfs_delayed_item *delayed_item = NULL; - int ret = 0; + int ret; - node = root->rb_node; + if (prev) + *prev = NULL; + if (next) + *next = NULL; while (node) { delayed_item = rb_entry(node, struct btrfs_delayed_item, rb_node); - prev_node = node; ret = btrfs_comp_cpu_keys(&delayed_item->key, key); - if (ret < 0) + if (ret < 0) { + prev_item = delayed_item; node = node->rb_right; - else if (ret > 0) + } else if (ret > 0) { + next_item = delayed_item; node = node->rb_left; - else + } else return delayed_item; } - if (prev) { - if (!prev_node) - *prev = NULL; - else if (ret < 0) - *prev = delayed_item; - else if ((node = rb_prev(prev_node)) != NULL) { - *prev = rb_entry(node, struct btrfs_delayed_item, - rb_node); - } else - *prev = NULL; - } + if (prev) + *prev = prev_item; + if (next) + *next = next_item; - if (next) { - if (!prev_node) - *next = NULL; - else if (ret > 0) - *next = delayed_item; - else if ((node = rb_next(prev_node)) != NULL) { - *next = rb_entry(node, struct btrfs_delayed_item, - rb_node); - } else - *next = NULL; - } return NULL; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This just moves some duplicated code into a helper. I couldn''t bring myself to add another copy in an upcoming patch. The delayed_root BUG() in __btrfs_remove_delayed_item() wasn''t needed. The pointer deref will oops later if its null. And now the remaining BUG() is in one place! :) Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/delayed-inode.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 67e0f9f..fcce951 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -382,6 +382,16 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_insertion_item( return item; } +static struct rb_root *get_ins_del_root(struct btrfs_delayed_node *delayed_node, + int ins_del) +{ + if (ins_del == BTRFS_DELAYED_INSERTION_ITEM) + return &delayed_node->ins_root; + if (ins_del == BTRFS_DELAYED_DELETION_ITEM) + return &delayed_node->del_root; + BUG(); +} + static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, struct btrfs_delayed_item *ins, int action) @@ -392,12 +402,7 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, struct btrfs_delayed_item *item; int cmp; - if (action == BTRFS_DELAYED_INSERTION_ITEM) - root = &delayed_node->ins_root; - else if (action == BTRFS_DELAYED_DELETION_ITEM) - root = &delayed_node->del_root; - else - BUG(); + root = get_ins_del_root(delayed_node, action); p = &root->rb_node; node = &ins->rb_node; @@ -460,15 +465,8 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) delayed_root = delayed_item->delayed_node->root->fs_info->delayed_root; - BUG_ON(!delayed_root); - BUG_ON(delayed_item->ins_or_del != BTRFS_DELAYED_DELETION_ITEM && - delayed_item->ins_or_del != BTRFS_DELAYED_INSERTION_ITEM); - - if (delayed_item->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM) - root = &delayed_item->delayed_node->ins_root; - else - root = &delayed_item->delayed_node->del_root; - + root = get_ins_del_root(delayed_item->delayed_node, + delayed_item->ins_or_del); rb_erase(&delayed_item->rb_node, root); delayed_item->delayed_node->count--; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zach Brown
2013-Jun-04 22:18 UTC
[PATCH 6/6] btrfs: get fewer delayed item refs during readdir
On every readdir call all the delayed items for the dir are put on a private list with a held reference. If they''re outside the f_pos values that this readdir call ends up using they''re just dropped and removed from the list. We can make some tiny changes to cut down on this overhead. First, let''s use the delayed item''s key-sorted rbtree to skip items that are before f_pos and will never be used. Second, let''s only acquire the new delayed items after we''ve exausted the existing in-tree items and still have room in the readdir buffer for more entries. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/delayed-inode.c | 21 ++++++++++----------- fs/btrfs/delayed-inode.h | 4 ++-- fs/btrfs/inode.c | 14 +++++++++++--- 3 files changed, 23 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index fcce951..2c3ec89 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1567,28 +1567,27 @@ int btrfs_inode_delayed_dir_index_count(struct inode *inode) return 0; } -void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, - struct list_head *del_list) +void btrfs_get_delayed_items(struct inode *inode, struct list_head *list, + struct btrfs_key *key, int action) { struct btrfs_delayed_node *delayed_node; struct btrfs_delayed_item *item; + struct btrfs_delayed_item *next; + struct rb_root *root; delayed_node = btrfs_get_delayed_node(inode); if (!delayed_node) return; - mutex_lock(&delayed_node->mutex); - item = __btrfs_first_delayed_insertion_item(delayed_node); - while (item) { - atomic_inc(&item->refs); - list_add_tail(&item->readdir_list, ins_list); - item = __btrfs_next_delayed_item(item); - } + root = get_ins_del_root(delayed_node, action); - item = __btrfs_first_delayed_deletion_item(delayed_node); + mutex_lock(&delayed_node->mutex); + item = __btrfs_lookup_delayed_item(root, key, NULL, &next); + if (item == NULL) + item = next; while (item) { atomic_inc(&item->refs); - list_add_tail(&item->readdir_list, del_list); + list_add_tail(&item->readdir_list, list); item = __btrfs_next_delayed_item(item); } mutex_unlock(&delayed_node->mutex); diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 573506b..7c401e1 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -133,8 +133,8 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root); void btrfs_destroy_delayed_inodes(struct btrfs_root *root); /* Used for readdir() */ -void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, - struct list_head *del_list); +void btrfs_get_delayed_items(struct inode *inode, struct list_head *list, + struct btrfs_key *key, int action); void btrfs_put_delayed_items(struct list_head *list); int btrfs_should_delete_dir_index(struct list_head *del_list, u64 index); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 53a8696..ad42724 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5048,13 +5048,14 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, path->reada = 1; - if (key_type == BTRFS_DIR_INDEX_KEY) - btrfs_get_delayed_items(inode, &ins_list, &del_list); - btrfs_set_key_type(&key, key_type); key.offset = filp->f_pos; key.objectid = btrfs_ino(inode); + if (key_type == BTRFS_DIR_INDEX_KEY) + btrfs_get_delayed_items(inode, &del_list, &key, + BTRFS_DELAYED_DELETION_ITEM); + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) goto err; @@ -5146,7 +5147,14 @@ next: path->slots[0]++; } + /* don''t acquire delayed item mutex while holding locked path */ + btrfs_free_path(path); + path = NULL; + if (key_type == BTRFS_DIR_INDEX_KEY) { + key.offset = filp->f_pos; + btrfs_get_delayed_items(inode, &ins_list, &key, + BTRFS_DELAYED_INSERTION_ITEM); ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir, &ins_list); if (ret) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Zach Brown (2013-06-04 18:17:54)> Hi gang, > > I finally sat down to fix that readdir hang that has been in the back > of my mind for a while. I *hope* that the fix is pretty simple: just > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > indicator that we shouldn''t return entries. Does this look reasonable?I like it, and it doesn''t look too far away from how others are abusing f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS loves to surprise me. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:> Quoting Zach Brown (2013-06-04 18:17:54) > > Hi gang, > > > > I finally sat down to fix that readdir hang that has been in the back > > of my mind for a while. I *hope* that the fix is pretty simple: just > > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > indicator that we shouldn''t return entries. Does this look reasonable? > > I like it, and it doesn''t look too far away from how others are abusing > f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > loves to surprise me.Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2013-Jun-05 01:19 UTC
Re: [PATCH 1/6] btrfs: set readdir f_pos only after filldir
On tue, 4 Jun 2013 15:17:55 -0700, Zach Brown wrote:> The only time we need to advance f_pos is after we''ve successfully given > a result to userspace via filldir. This simplification gets rid of the > is_curr variable used to update f_pos for the delayed item readdir > entries. > > Signed-off-by: Zach Brown <zab@redhat.com>Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>> --- > fs/btrfs/delayed-inode.c | 5 +++-- > fs/btrfs/inode.c | 17 +++++++---------- > 2 files changed, 10 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c > index 5615eac..4d846a2 100644 > --- a/fs/btrfs/delayed-inode.c > +++ b/fs/btrfs/delayed-inode.c > @@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, > continue; > } > > - filp->f_pos = curr->key.offset; > - > di = (struct btrfs_dir_item *)curr->data; > name = (char *)(di + 1); > name_len = le16_to_cpu(di->name_len); > @@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent, > over = filldir(dirent, name, name_len, curr->key.offset, > location.objectid, d_type); > > + if (!over) > + filp->f_pos = curr->key.offset + 1; > + > if (atomic_dec_and_test(&curr->refs)) > kfree(curr); > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index b11a95e..6a5784b 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, > char tmp_name[32]; > char *name_ptr; > int name_len; > - int is_curr = 0; /* filp->f_pos points to the current index? */ > > /* FIXME, use a real flag for deciding about the key type */ > if (root->fs_info->tree_root == root) > @@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent, > found_key.offset)) > goto next; > > - filp->f_pos = found_key.offset; > - is_curr = 1; > - > di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item); > di_cur = 0; > di_total = btrfs_item_size(leaf, item); > @@ -5130,6 +5126,9 @@ skip: > > if (over) > goto nopos; > + > + filp->f_pos = found_key.offset + 1; > + > di_len = btrfs_dir_name_len(leaf, di) + > btrfs_dir_data_len(leaf, di) + sizeof(*di); > di_cur += di_len; > @@ -5140,23 +5139,21 @@ next: > } > > if (key_type == BTRFS_DIR_INDEX_KEY) { > - if (is_curr) > - filp->f_pos++; > ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir, > &ins_list); > if (ret) > goto nopos; > } > > - /* Reached end of directory/root. Bump pos past the last item. */ > - if (key_type == BTRFS_DIR_INDEX_KEY) > + /* Reached end of directory/root */ > + if (key_type == BTRFS_DIR_INDEX_KEY) { > /* > * 32-bit glibc will use getdents64, but then strtol - > * so the last number we can serve is this. > */ > filp->f_pos = 0x7fffffff; > - else > - filp->f_pos++; > + } > + > nopos: > ret = 0; > err: >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: >> Quoting Zach Brown (2013-06-04 18:17:54) >>> Hi gang, >>> >>> I finally sat down to fix that readdir hang that has been in the back >>> of my mind for a while. I *hope* that the fix is pretty simple: just >>> don''t manufacture a fake f_pos, I *think* we can abuse f_version as an >>> indicator that we shouldn''t return entries. Does this look reasonable? >> >> I like it, and it doesn''t look too far away from how others are abusing >> f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS >> loves to surprise me. > > Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :)If we can not use f_version, we can use private_data. I think this variant is safe. Miao> > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Jun-05 13:36 UTC
Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:> On tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote: > > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: > >> Quoting Zach Brown (2013-06-04 18:17:54) > >>> Hi gang, > >>> > >>> I finally sat down to fix that readdir hang that has been in the back > >>> of my mind for a while. I *hope* that the fix is pretty simple: just > >>> don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > >>> indicator that we shouldn''t return entries. Does this look reasonable? > >> > >> I like it, and it doesn''t look too far away from how others are abusing > >> f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > >> loves to surprise me. > > > > Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) > > If we can not use f_version, we can use private_data. I think this variant is > safe.private_data is used within the ioctl user transactions, so a readdir(mountpoint) with a user transaction running can break it. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:> On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote: >> On tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote: >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: >>>> Quoting Zach Brown (2013-06-04 18:17:54) >>>>> Hi gang, >>>>> >>>>> I finally sat down to fix that readdir hang that has been in the back >>>>> of my mind for a while. I *hope* that the fix is pretty simple: just >>>>> don''t manufacture a fake f_pos, I *think* we can abuse f_version as an >>>>> indicator that we shouldn''t return entries. Does this look reasonable? >>>> >>>> I like it, and it doesn''t look too far away from how others are abusing >>>> f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS >>>> loves to surprise me. >>> >>> Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) >> >> If we can not use f_version, we can use private_data. I think this variant is >> safe. > > private_data is used within the ioctl user transactions, so a > readdir(mountpoint) with a user transaction running can break it.don''t worry, we can allocate a structure to keep both transaction handle and the information of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep more information if need in the future. Beside the above method, we also can abuse the low bits of private_data to indicator that we shouldn''t return entries. Thanks Miao> > david > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Jun-06 13:55 UTC
Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie wrote:> On wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote: > > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote: > >> On tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote: > >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: > >>>> Quoting Zach Brown (2013-06-04 18:17:54) > >>>>> Hi gang, > >>>>> > >>>>> I finally sat down to fix that readdir hang that has been in the back > >>>>> of my mind for a while. I *hope* that the fix is pretty simple: just > >>>>> don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > >>>>> indicator that we shouldn''t return entries. Does this look reasonable? > >>>> > >>>> I like it, and it doesn''t look too far away from how others are abusing > >>>> f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > >>>> loves to surprise me. > >>> > >>> Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) > >> > >> If we can not use f_version, we can use private_data. I think this variant is > >> safe. > > > > private_data is used within the ioctl user transactions, so a > > readdir(mountpoint) with a user transaction running can break it. > > don''t worry, we can allocate a structure to keep both transaction handle and the information > of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep > more information if need in the future. > > Beside the above method, we also can abuse the low bits of private_data to indicator that > we shouldn''t return entries.Allocating a full structure for private_data sounds better than directly modifying the pointer value itself. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting David Sterba (2013-06-06 09:55:50)> On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie wrote: > > On wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote: > > > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote: > > >> On tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote: > > >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: > > >>>> Quoting Zach Brown (2013-06-04 18:17:54) > > >>>>> Hi gang, > > >>>>> > > >>>>> I finally sat down to fix that readdir hang that has been in the back > > >>>>> of my mind for a while. I *hope* that the fix is pretty simple: just > > >>>>> don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > >>>>> indicator that we shouldn''t return entries. Does this look reasonable? > > >>>> > > >>>> I like it, and it doesn''t look too far away from how others are abusing > > >>>> f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > > >>>> loves to surprise me. > > >>> > > >>> Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) > > >> > > >> If we can not use f_version, we can use private_data. I think this variant is > > >> safe. > > > > > > private_data is used within the ioctl user transactions, so a > > > readdir(mountpoint) with a user transaction running can break it. > > > > don''t worry, we can allocate a structure to keep both transaction handle and the information > > of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep > > more information if need in the future. > > > > Beside the above method, we also can abuse the low bits of private_data to indicator that > > we shouldn''t return entries. > > Allocating a full structure for private_data sounds better than directly > modifying the pointer value itself.I''d actually rather tag the pointers than go through kmalloc, we just need one bit (maybe that really just shows how badly we''ve corrupted poor Miao). But, we''re not there yet, I think Zach''s initial patch will work fine. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown wrote:> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: > > Quoting Zach Brown (2013-06-04 18:17:54) > > > Hi gang, > > > > > > I finally sat down to fix that readdir hang that has been in the back > > > of my mind for a while. I *hope* that the fix is pretty simple: just > > > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > > indicator that we shouldn''t return entries. Does this look reasonable? > > > > I like it, and it doesn''t look too far away from how others are abusing > > f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > > loves to surprise me. > > Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :)Or a week later. Pretty close! I couldn''t get NFS to break. Clients see new entries created directly in the exported btrfs and on either of noac and actime=1 client mounts. For whatever that''s worth. But I did find that I''d broken the case of trying to re-enable readdir results by seeking past the last entry (which happens to be the current f_pos now that we''re using f_version). Here''s the incremental fix against what Josef has in -next. I''m cool with either squashing or just committing it. - z Subject: [PATCH] btrfs: reset f_version when seeking to pos Commit 63e3dfe ("btrfs: fix readdir hang with offsets past INT_MAX") switched to using f_version to stop readdir results instead of setting a large f_pos. It inadvertantly changed behaviour in the case where an app specifically seeks to one past the last valid dent->d_off it has seen. Previously f_pos would have changed from the fake f_pos to this new f_pos which would let readdir return new entries. But now that it''s using f_version it might not have seen new entries. generic_file_llseek() won''t clear f_version if the desirned pos happens to be the current f_pos. So we add a little wrapper to notice this case and clear f_version so that entries can be seen in this case. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/inode.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1059c90..590c274 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4997,6 +4997,23 @@ unsigned char btrfs_filetype_table[] = { * which prevents readdir results until seek resets f_pos and f_version. */ #define BTRFS_READDIR_EOF ~0ULL +static loff_t btrfs_dir_llseek(struct file *file, loff_t offset, int whence) +{ + struct inode *inode = file->f_mapping->host; + loff_t ret; + + /* + * f_version isn''t reset if a seek is attempted to the current pos. A + * caller can be trying to see more entries by seeking past the last + * entry to the current pos after creating a new entry. + */ + mutex_lock(&inode->i_mutex); + ret = generic_file_llseek(file, offset, whence); + if (ret == offset && file->f_version == BTRFS_READDIR_EOF) + file->f_version = 0; + mutex_unlock(&inode->i_mutex); + return ret; +} static int btrfs_real_readdir(struct file *filp, void *dirent, filldir_t filldir) @@ -8642,7 +8659,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { }; static const struct file_operations btrfs_dir_file_operations = { - .llseek = generic_file_llseek, + .llseek = btrfs_dir_llseek, .read = generic_read_dir, .readdir = btrfs_real_readdir, .unlocked_ioctl = btrfs_ioctl, -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Zach Brown (2013-06-10 18:39:58)> On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown wrote: > > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote: > > > Quoting Zach Brown (2013-06-04 18:17:54) > > > > Hi gang, > > > > > > > > I finally sat down to fix that readdir hang that has been in the back > > > > of my mind for a while. I *hope* that the fix is pretty simple: just > > > > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > > > indicator that we shouldn''t return entries. Does this look reasonable? > > > > > > I like it, and it doesn''t look too far away from how others are abusing > > > f_version. Have you tried with NFS? I don''t think it''ll hurt, but NFS > > > loves to surprise me. > > > > Mm, no, I hadn''t. I''ll give it a go tomorrow. What could go wrong? :) > > Or a week later. Pretty close! > > I couldn''t get NFS to break. Clients see new entries created directly > in the exported btrfs and on either of noac and actime=1 client mounts. > For whatever that''s worth.Great.> > But I did find that I''d broken the case of trying to re-enable readdir > results by seeking past the last entry (which happens to be the current > f_pos now that we''re using f_version). > > Here''s the incremental fix against what Josef has in -next. I''m cool > with either squashing or just committing it.Lets squash it in, Josef loves to rebase. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:> Hi gang, > > I finally sat down to fix that readdir hang that has been in the back > of my mind for a while. I *hope* that the fix is pretty simple: just > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > indicator that we shouldn''t return entries. Does this look reasonable? > > We still have the problem that we can generate valid large f_pos values > that can confuse 32bit userspace, but that''s a different problem. I > think we''ll want filldir generation of EOVERFLOW like what exists for > large inodes. > > The rest of the patches are cleanups that I saw when absorbing the > code. It''s all lightly tested with xfstests but it wouldn''t surprise > me if I missed something so review is appreciated. > > Thanks! >One of these patches is making new entries not show up in readdir. This was discovered while running stress.sh overnight, it complained about files not matching but when they were checked the files matched. Dropping the entire series made stress.sh run fine. So I''m dropping these for the next merge window but I''ll dig into it and try and figure out what was causing the problem. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Josef Bacik (2013-07-01 08:54:35)> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote: > > Hi gang, > > > > I finally sat down to fix that readdir hang that has been in the back > > of my mind for a while. I *hope* that the fix is pretty simple: just > > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > indicator that we shouldn''t return entries. Does this look reasonable? > > > > We still have the problem that we can generate valid large f_pos values > > that can confuse 32bit userspace, but that''s a different problem. I > > think we''ll want filldir generation of EOVERFLOW like what exists for > > large inodes. > > > > The rest of the patches are cleanups that I saw when absorbing the > > code. It''s all lightly tested with xfstests but it wouldn''t surprise > > me if I missed something so review is appreciated. > > > > Thanks! > > > > One of these patches is making new entries not show up in readdir. This was > discovered while running stress.sh overnight, it complained about files not > matching but when they were checked the files matched. Dropping the entire > series made stress.sh run fine. So I''m dropping these for the next merge window > but I''ll dig into it and try and figure out what was causing the problem.Unfortunately I''ve only triggered this on flash, and the run takes about two hours to trigger. Trying now with some extra printks to see if I can nail it down -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > code. It''s all lightly tested with xfstests but it wouldn''t surprise > > me if I missed something so review is appreciated.*mmm, hmmm*> One of these patches is making new entries not show up in readdir. This was > discovered while running stress.sh overnight, it complained about files not > matching but when they were checked the files matched. Dropping the entire > series made stress.sh run fine. So I''m dropping these for the next merge window > but I''ll dig into it and try and figure out what was causing the problem.Nerts. It''s got to be the delayed inode stuff. Maybe it''s some unlink/recreate pattern? Is this a thing that stress.sh does? (Where''s stress.sh live?) - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Zach Brown (2013-07-01 12:10:02)> > > code. It''s all lightly tested with xfstests but it wouldn''t surprise > > > me if I missed something so review is appreciated. > > *mmm, hmmm* > > > One of these patches is making new entries not show up in readdir. This was > > discovered while running stress.sh overnight, it complained about files not > > matching but when they were checked the files matched. Dropping the entire > > series made stress.sh run fine. So I''m dropping these for the next merge window > > but I''ll dig into it and try and figure out what was causing the problem. > > Nerts. It''s got to be the delayed inode stuff. > > Maybe it''s some unlink/recreate pattern? Is this a thing that stress.sh > does? (Where''s stress.sh live?)It''s an old namesys tool, I''ve copied it here: http://masoncoding.com/mason/tools/stress.sh My command line: stress.sh -n 50 -s -c /build/linux /mnt Basically its: 1) Make a list of all files in /build/linux and their md5sums 2) Start 50 procs 3) Each proc copies /build/linux into /mnt/stress/proc_num/ 4) Each proc compares the md5sums of its private copy with the original master 5) Each proc deletes the private copy 6) Repeat steps 2-5 forever. The most likely cause of the bug I''m seeing is readdir not finding new files. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 01, 2013 at 08:54:35AM -0400, Josef Bacik wrote:> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote: > > > > I finally sat down to fix that readdir hang that has been in the back > > of my mind for a while. I *hope* that the fix is pretty simple: just > > don''t manufacture a fake f_pos, I *think* we can abuse f_version as an > > indicator that we shouldn''t return entries. Does this look reasonable? > > One of these patches is making new entries not show up in readdir. This was > discovered while running stress.sh overnight, it complained about files not > matching but when they were checked the files matched. Dropping the entire > series made stress.sh run fine. So I''m dropping these for the next merge window > but I''ll dig into it and try and figure out what was causing the problem.OK, how about this. First, just drop the series. Most of it was opportunistic cleanups that I saw as I was reading the code. But it certainly isn''t a comprehensive cleanup. We''d still want to go back later and fix how inefficient the delayed item data structures are for readdir. So from some perspective it''s just risky churn with little upside. And I think I found a much simpler way to stop the current readdir from looping instead of mucking around with f_version. Just use LLONG_MAX if entries have already overflowed INT_MAX. We''d still want real freed offset reuse some day, but that''s a bunch of work that''ll have to be done very carefully. This will at least stop 64bit apps from failing with offsets past 64bits and is very low risk. So just add this patch and forget about the rest of the series? It''ll still technically conflict with the s/filp->f_pos/ctx->pos/ in the readdir interface change in -next but that fixup is trivial. - z From e295deb0f56f846738ee3dec63fe0350f3952503 Mon Sep 17 00:00:00 2001 From: Zach Brown <zab@redhat.com> Date: Wed, 10 Jul 2013 19:48:51 -0400 Subject: [PATCH] btrfs: don''t loop on large offsets in readdir When btrfs readdir() hits the last entry it sets the readdir offset to a huge value to stop buggy apps from breaking when the same name is returned by readdir() with concurrent rename()s. But unconditionally setting the offset to INT_MAX causes readdir() to loop returning any entries with offsets past INT_MAX. It only takes a few hours of constant file creation and removal to create entries past INT_MAX. So let''s set the huge offset to LLONG_MAX if the last entry has already overflowed 32bit loff_t. Without large offsets behaviour is identical. With large offsets 64bit apps will work and 32bit apps will be no more broken than they currently are if they see large offsets. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/inode.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c157482..8583e8d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5186,14 +5186,31 @@ next: } /* Reached end of directory/root. Bump pos past the last item. */ - if (key_type == BTRFS_DIR_INDEX_KEY) - /* - * 32-bit glibc will use getdents64, but then strtol - - * so the last number we can serve is this. - */ - filp->f_pos = 0x7fffffff; - else - filp->f_pos++; + filp->f_pos++; + + /* + * Stop new entries from being returned after we return the last + * entry. + * + * New directory entries are assigned a strictly increasing + * offset. This means that new entries created during readdir + * are *guaranteed* to be seen in the future by that readdir. + * This has broken buggy programs which operate on names as + * they''re returned by readdir. Until we re-use freed offsets + * we have this hack to stop new entries from being returned + * under the assumption that they''ll never reach this huge + * offset. + * + * This is being careful not to overflow 32bit loff_t unless the + * last entry requires it because doing so has broken 32bit apps + * in the past. + */ + if (key_type == BTRFS_DIR_INDEX_KEY) { + if (filp->f_pos >= INT_MAX) + filp->f_pos = LLONG_MAX; + else + filp->f_pos = INT_MAX; + } nopos: ret = 0; err: -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html