thr3ads.net - Btrfs devel - [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Zach Brown

2013-Jun-04 22:17 UTC

[PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Hi gang,

I finally sat down to fix that readdir hang that has been in the back
of my mind for a while.  I *hope* that the fix is pretty simple: just
don''t manufacture a fake f_pos, I *think* we can abuse f_version as an
indicator that we shouldn''t return entries.  Does this look reasonable?

We still have the problem that we can generate valid large f_pos values
that can confuse 32bit userspace, but that''s a different problem.  I
think we''ll want filldir generation of EOVERFLOW like what exists for
large inodes. 

The rest of the patches are cleanups that I saw when absorbing the
code.  It''s all lightly tested with xfstests but it wouldn''t
surprise
me if I missed something so review is appreciated.

Thanks!

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:17 UTC

head link

[PATCH 1/6] btrfs: set readdir f_pos only after filldir

The only time we need to advance f_pos is after we''ve successfully
given
a result to userspace via filldir.  This simplification gets rid of the
is_curr variable used to update f_pos for the delayed item readdir
entries.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c |  5 +++--
 fs/btrfs/inode.c         | 17 +++++++----------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5615eac..4d846a2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp,
void *dirent,
 			continue;
 		}
 
-		filp->f_pos = curr->key.offset;
-
 		di = (struct btrfs_dir_item *)curr->data;
 		name = (char *)(di + 1);
 		name_len = le16_to_cpu(di->name_len);
@@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file *filp,
void *dirent,
 		over = filldir(dirent, name, name_len, curr->key.offset,
 			       location.objectid, d_type);
 
+		if (!over)
+			filp->f_pos = curr->key.offset + 1;
+
 		if (atomic_dec_and_test(&curr->refs))
 			kfree(curr);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b11a95e..6a5784b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 	char tmp_name[32];
 	char *name_ptr;
 	int name_len;
-	int is_curr = 0;	/* filp->f_pos points to the current index? */
 
 	/* FIXME, use a real flag for deciding about the key type */
 	if (root->fs_info->tree_root == root)
@@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 						  found_key.offset))
 			goto next;
 
-		filp->f_pos = found_key.offset;
-		is_curr = 1;
-
 		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
 		di_cur = 0;
 		di_total = btrfs_item_size(leaf, item);
@@ -5130,6 +5126,9 @@ skip:
 
 			if (over)
 				goto nopos;
+
+			filp->f_pos = found_key.offset + 1;
+
 			di_len = btrfs_dir_name_len(leaf, di) +
 				 btrfs_dir_data_len(leaf, di) + sizeof(*di);
 			di_cur += di_len;
@@ -5140,23 +5139,21 @@ next:
 	}
 
 	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		if (is_curr)
-			filp->f_pos++;
 		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
 						      &ins_list);
 		if (ret)
 			goto nopos;
 	}
 
-	/* Reached end of directory/root. Bump pos past the last item. */
-	if (key_type == BTRFS_DIR_INDEX_KEY)
+	/* Reached end of directory/root */
+	if (key_type == BTRFS_DIR_INDEX_KEY) {
 		/*
 		 * 32-bit glibc will use getdents64, but then strtol -
 		 * so the last number we can serve is this.
 		 */
 		filp->f_pos = 0x7fffffff;
-	else
-		filp->f_pos++;
+	}
+
 nopos:
 	ret = 0;
 err:
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:17 UTC

head link

[PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX

To work around bugs in userspace btrfs_real_readdir() sets f_pos to an
offset that will prevent any future entries from being returned once the
last entry is hit.  Over time this supposedly impossible offset was
decreased from the initial U64_MAX to INT_MAX to appease 32bit
userspace.

  https://oss.oracle.com/pipermail/btrfs-devel/2008-January/000437.html
  commit c2a8b6e11009398ca9363d8ba8d4e7e40fb897fd
  commit 89f135d8b53bcccafd91a075366d2704ba257cf3
  commit 406266ab9ac8ed8b085c58aacd9e3161480dc5d5

The remaining problem is that resetting f_pos to some impossible offset
causes userspace to spin when it''s, well, possible for an entry to have
that offset.  It takes a single thread on a modern cpu about nine hours
of constant file creation and removal to hit an offset past INT_MAX on a
single spindle.

Instead of trying to find an impossible f_pos that doesn''t break
various
layers of the stack, let''s use f_version to indicate that readdir
should
stop returning entries until seek changes f_pos and clears f_version.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6a5784b..e6e2b86 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4983,6 +4983,16 @@ unsigned char btrfs_filetype_table[] = {
 	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
 };
 
+/*
+ * There have been buggy applications that can''t handle one readdir
pass
+ * returning the same name for different inodes that are unlinked and
+ * re-created during the readdir pass.  This was partially worked around
+ * by trying to set f_pos to magic values that broke either 32bit userspace
+ * or entries with huge offsets.  Now we set f_version to a magic value
+ * which prevents readdir results until seek resets f_pos and f_version.
+ */
+#define BTRFS_READDIR_EOF ~0ULL
+
 static int btrfs_real_readdir(struct file *filp, void *dirent,
 			      filldir_t filldir)
 {
@@ -5008,6 +5018,9 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 	char *name_ptr;
 	int name_len;
 
+	if (filp->f_version == BTRFS_READDIR_EOF)
+		return 0;
+
 	/* FIXME, use a real flag for deciding about the key type */
 	if (root->fs_info->tree_root == root)
 		key_type = BTRFS_DIR_ITEM_KEY;
@@ -5145,14 +5158,9 @@ next:
 			goto nopos;
 	}
 
-	/* Reached end of directory/root */
-	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		/*
-		 * 32-bit glibc will use getdents64, but then strtol -
-		 * so the last number we can serve is this.
-		 */
-		filp->f_pos = 0x7fffffff;
-	}
+	/* prevent further readdir results without seeking once we hit EOF */
+	if (key_type == BTRFS_DIR_INDEX_KEY)
+		filp->f_version = BTRFS_READDIR_EOF;
 
 nopos:
 	ret = 0;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:17 UTC

head link

[PATCH 3/6] btrfs: trivial delayed item readdir list cleanups

Just call btrfs_put_delayed_items() for each list rather than having two
list arguments and duplicated code.

list_for_each_entry_safe() can handle an empty list.

We don''t have to conditionally use and tear down the lists if we always
initialize them to be empty.  They''re only populated when needed and
the
rest of the uses just find empty lists.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 17 ++---------------
 fs/btrfs/delayed-inode.h |  3 +--
 fs/btrfs/inode.c         | 17 ++++++-----------
 3 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 4d846a2..bc753fc 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1618,18 +1618,11 @@ void btrfs_get_delayed_items(struct inode *inode, struct
list_head *ins_list,
 	atomic_dec(&delayed_node->refs);
 }
 
-void btrfs_put_delayed_items(struct list_head *ins_list,
-			     struct list_head *del_list)
+void btrfs_put_delayed_items(struct list_head *list)
 {
 	struct btrfs_delayed_item *curr, *next;
 
-	list_for_each_entry_safe(curr, next, ins_list, readdir_list) {
-		list_del(&curr->readdir_list);
-		if (atomic_dec_and_test(&curr->refs))
-			kfree(curr);
-	}
-
-	list_for_each_entry_safe(curr, next, del_list, readdir_list) {
+	list_for_each_entry_safe(curr, next, list, readdir_list) {
 		list_del(&curr->readdir_list);
 		if (atomic_dec_and_test(&curr->refs))
 			kfree(curr);
@@ -1642,9 +1635,6 @@ int btrfs_should_delete_dir_index(struct list_head
*del_list,
 	struct btrfs_delayed_item *curr, *next;
 	int ret;
 
-	if (list_empty(del_list))
-		return 0;
-
 	list_for_each_entry_safe(curr, next, del_list, readdir_list) {
 		if (curr->key.offset > index)
 			break;
@@ -1679,9 +1669,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp,
void *dirent,
 	int over = 0;
 	unsigned char d_type;
 
-	if (list_empty(ins_list))
-		return 0;
-
 	/*
 	 * Changing the data of the delayed item is impossible. So
 	 * we needn''t lock them. And we have held i_mutex of the
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 1d5c5f7..573506b 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -135,8 +135,7 @@ void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
 /* Used for readdir() */
 void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
 			     struct list_head *del_list);
-void btrfs_put_delayed_items(struct list_head *ins_list,
-			     struct list_head *del_list);
+void btrfs_put_delayed_items(struct list_head *list);
 int btrfs_should_delete_dir_index(struct list_head *del_list,
 				  u64 index);
 int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6e2b86..53a8696 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5003,8 +5003,8 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 	struct btrfs_key key;
 	struct btrfs_key found_key;
 	struct btrfs_path *path;
-	struct list_head ins_list;
-	struct list_head del_list;
+	LIST_HEAD(ins_list);
+	LIST_HEAD(del_list);
 	int ret;
 	struct extent_buffer *leaf;
 	int slot;
@@ -5048,11 +5048,8 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 
 	path->reada = 1;
 
-	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		INIT_LIST_HEAD(&ins_list);
-		INIT_LIST_HEAD(&del_list);
+	if (key_type == BTRFS_DIR_INDEX_KEY)
 		btrfs_get_delayed_items(inode, &ins_list, &del_list);
-	}
 
 	btrfs_set_key_type(&key, key_type);
 	key.offset = filp->f_pos;
@@ -5083,9 +5080,7 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 			break;
 		if (found_key.offset < filp->f_pos)
 			goto next;
-		if (key_type == BTRFS_DIR_INDEX_KEY &&
-		    btrfs_should_delete_dir_index(&del_list,
-						  found_key.offset))
+		if (btrfs_should_delete_dir_index(&del_list, found_key.offset))
 			goto next;
 
 		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
@@ -5165,8 +5160,8 @@ next:
 nopos:
 	ret = 0;
 err:
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		btrfs_put_delayed_items(&ins_list, &del_list);
+	btrfs_put_delayed_items(&ins_list);
+	btrfs_put_delayed_items(&del_list);
 	btrfs_free_path(path);
 	return ret;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:17 UTC

head link

[PATCH 4/6] btrfs: simplify finding next/prev delayed items

I''d like to use the currently unused next/prev arguments to
__btrfs_lookup_delayed_item() in a future patch.  I noticed that the
code could be simplified.

We don''t need to use rb_next() or rb_prev() to walk back up the tree
once we''ve failed to find the key at a leaf.  We can record the most
recent next and prev items as we descend and return those when the
search fails.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 54 +++++++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index bc753fc..67e0f9f 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -326,11 +326,11 @@ static struct btrfs_delayed_item
*btrfs_alloc_delayed_item(u32 data_len)
  * __btrfs_lookup_delayed_item - look up the delayed item by key
  * @delayed_node: pointer to the delayed node
  * @key:	  the key to look up
- * @prev:	  used to store the prev item if the right item isn''t found
- * @next:	  used to store the next item if the right item isn''t found
+ * @prev:	  set to the item before @key when it isn''t found
+ * @next:	  set to the item after @key when it isn''t found
  *
- * Note: if we don''t find the right item, we will return the prev item
and
- * the next item.
+ * @prev and @next are only correct if the search terminates at a leaf so
+ * they''re only set if an item with @key is not found.
  */
 static struct btrfs_delayed_item *__btrfs_lookup_delayed_item(
 				struct rb_root *root,
@@ -338,48 +338,36 @@ static struct btrfs_delayed_item
*__btrfs_lookup_delayed_item(
 				struct btrfs_delayed_item **prev,
 				struct btrfs_delayed_item **next)
 {
-	struct rb_node *node, *prev_node = NULL;
+	struct rb_node *node = root->rb_node;
+	struct btrfs_delayed_item *prev_item = NULL;
+	struct btrfs_delayed_item *next_item = NULL;
 	struct btrfs_delayed_item *delayed_item = NULL;
-	int ret = 0;
+	int ret;
 
-	node = root->rb_node;
+	if (prev)
+		*prev = NULL;
+	if (next)
+		*next = NULL;
 
 	while (node) {
 		delayed_item = rb_entry(node, struct btrfs_delayed_item,
 					rb_node);
-		prev_node = node;
 		ret = btrfs_comp_cpu_keys(&delayed_item->key, key);
-		if (ret < 0)
+		if (ret < 0) {
+			prev_item = delayed_item;
 			node = node->rb_right;
-		else if (ret > 0)
+		} else if (ret > 0) {
+			next_item = delayed_item;
 			node = node->rb_left;
-		else
+		} else
 			return delayed_item;
 	}
 
-	if (prev) {
-		if (!prev_node)
-			*prev = NULL;
-		else if (ret < 0)
-			*prev = delayed_item;
-		else if ((node = rb_prev(prev_node)) != NULL) {
-			*prev = rb_entry(node, struct btrfs_delayed_item,
-					 rb_node);
-		} else
-			*prev = NULL;
-	}
+	if (prev)
+		*prev = prev_item;
+	if (next)
+		*next = next_item;
 
-	if (next) {
-		if (!prev_node)
-			*next = NULL;
-		else if (ret > 0)
-			*next = delayed_item;
-		else if ((node = rb_next(prev_node)) != NULL) {
-			*next = rb_entry(node, struct btrfs_delayed_item,
-					 rb_node);
-		} else
-			*next = NULL;
-	}
 	return NULL;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:17 UTC

head link

[PATCH 5/6] btrfs: add helper to get delayed item root

This just moves some duplicated code into a helper.  I couldn''t bring
myself to add another copy in an upcoming patch.

The delayed_root BUG() in __btrfs_remove_delayed_item() wasn''t needed.
The pointer deref will oops later if its null.

And now the remaining BUG() is in one place! :)

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 67e0f9f..fcce951 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -382,6 +382,16 @@ static struct btrfs_delayed_item
*__btrfs_lookup_delayed_insertion_item(
 	return item;
 }
 
+static struct rb_root *get_ins_del_root(struct btrfs_delayed_node
*delayed_node,
+					int ins_del)
+{
+	if (ins_del == BTRFS_DELAYED_INSERTION_ITEM)
+		return &delayed_node->ins_root;
+	if (ins_del == BTRFS_DELAYED_DELETION_ITEM)
+		return &delayed_node->del_root;
+	BUG();
+}
+
 static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node,
 				    struct btrfs_delayed_item *ins,
 				    int action)
@@ -392,12 +402,7 @@ static int __btrfs_add_delayed_item(struct
btrfs_delayed_node *delayed_node,
 	struct btrfs_delayed_item *item;
 	int cmp;
 
-	if (action == BTRFS_DELAYED_INSERTION_ITEM)
-		root = &delayed_node->ins_root;
-	else if (action == BTRFS_DELAYED_DELETION_ITEM)
-		root = &delayed_node->del_root;
-	else
-		BUG();
+	root = get_ins_del_root(delayed_node, action);
 	p = &root->rb_node;
 	node = &ins->rb_node;
 
@@ -460,15 +465,8 @@ static void __btrfs_remove_delayed_item(struct
btrfs_delayed_item *delayed_item)
 
 	delayed_root =
delayed_item->delayed_node->root->fs_info->delayed_root;
 
-	BUG_ON(!delayed_root);
-	BUG_ON(delayed_item->ins_or_del != BTRFS_DELAYED_DELETION_ITEM &&
-	       delayed_item->ins_or_del != BTRFS_DELAYED_INSERTION_ITEM);
-
-	if (delayed_item->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM)
-		root = &delayed_item->delayed_node->ins_root;
-	else
-		root = &delayed_item->delayed_node->del_root;
-
+	root = get_ins_del_root(delayed_item->delayed_node,
+				delayed_item->ins_or_del);
 	rb_erase(&delayed_item->rb_node, root);
 	delayed_item->delayed_node->count--;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 22:18 UTC

head link

[PATCH 6/6] btrfs: get fewer delayed item refs during readdir

On every readdir call all the delayed items for the dir are put on a
private list with a held reference.  If they''re outside the f_pos
values
that this readdir call ends up using they''re just dropped and removed
from the list.  We can make some tiny changes to cut down on this
overhead.

First, let''s use the delayed item''s key-sorted rbtree to skip
items that
are before f_pos and will never be used.

Second, let''s only acquire the new delayed items after we''ve
exausted
the existing in-tree items and still have room in the readdir buffer for
more entries.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 21 ++++++++++-----------
 fs/btrfs/delayed-inode.h |  4 ++--
 fs/btrfs/inode.c         | 14 +++++++++++---
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index fcce951..2c3ec89 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1567,28 +1567,27 @@ int btrfs_inode_delayed_dir_index_count(struct inode
*inode)
 	return 0;
 }
 
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-			     struct list_head *del_list)
+void btrfs_get_delayed_items(struct inode *inode, struct list_head *list,
+			     struct btrfs_key *key, int action)
 {
 	struct btrfs_delayed_node *delayed_node;
 	struct btrfs_delayed_item *item;
+	struct btrfs_delayed_item *next;
+	struct rb_root *root;
 
 	delayed_node = btrfs_get_delayed_node(inode);
 	if (!delayed_node)
 		return;
 
-	mutex_lock(&delayed_node->mutex);
-	item = __btrfs_first_delayed_insertion_item(delayed_node);
-	while (item) {
-		atomic_inc(&item->refs);
-		list_add_tail(&item->readdir_list, ins_list);
-		item = __btrfs_next_delayed_item(item);
-	}
+	root = get_ins_del_root(delayed_node, action);
 
-	item = __btrfs_first_delayed_deletion_item(delayed_node);
+	mutex_lock(&delayed_node->mutex);
+	item = __btrfs_lookup_delayed_item(root, key, NULL, &next);
+	if (item == NULL)
+		item = next;
 	while (item) {
 		atomic_inc(&item->refs);
-		list_add_tail(&item->readdir_list, del_list);
+		list_add_tail(&item->readdir_list, list);
 		item = __btrfs_next_delayed_item(item);
 	}
 	mutex_unlock(&delayed_node->mutex);
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 573506b..7c401e1 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -133,8 +133,8 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root);
 void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
 
 /* Used for readdir() */
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-			     struct list_head *del_list);
+void btrfs_get_delayed_items(struct inode *inode, struct list_head *list,
+			     struct btrfs_key *key, int action);
 void btrfs_put_delayed_items(struct list_head *list);
 int btrfs_should_delete_dir_index(struct list_head *del_list,
 				  u64 index);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 53a8696..ad42724 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5048,13 +5048,14 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
 
 	path->reada = 1;
 
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		btrfs_get_delayed_items(inode, &ins_list, &del_list);
-
 	btrfs_set_key_type(&key, key_type);
 	key.offset = filp->f_pos;
 	key.objectid = btrfs_ino(inode);
 
+	if (key_type == BTRFS_DIR_INDEX_KEY)
+		btrfs_get_delayed_items(inode, &del_list, &key,
+					BTRFS_DELAYED_DELETION_ITEM);
+
 	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 	if (ret < 0)
 		goto err;
@@ -5146,7 +5147,14 @@ next:
 		path->slots[0]++;
 	}
 
+	/* don''t acquire delayed item mutex while holding locked path */
+	btrfs_free_path(path);
+	path = NULL;
+
 	if (key_type == BTRFS_DIR_INDEX_KEY) {
+		key.offset = filp->f_pos;
+		btrfs_get_delayed_items(inode, &ins_list, &key,
+					BTRFS_DELAYED_INSERTION_ITEM);
 		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
 						      &ins_list);
 		if (ret)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-Jun-04 23:16 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Quoting Zach Brown (2013-06-04 18:17:54)> Hi gang,
> 
> I finally sat down to fix that readdir hang that has been in the back
> of my mind for a while.  I *hope* that the fix is pretty simple: just
> don''t manufacture a fake f_pos, I *think* we can abuse f_version
as an
> indicator that we shouldn''t return entries.  Does this look
reasonable?
I like it, and it doesn''t look too far away from how others are abusing
f_version.  Have you tried with NFS?  I don''t think it''ll
hurt, but NFS
loves to surprise me.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-04 23:26 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason
wrote:> Quoting Zach Brown (2013-06-04 18:17:54)
> > Hi gang,
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don''t manufacture a fake f_pos, I *think* we can abuse
f_version as an
> > indicator that we shouldn''t return entries.  Does this look
reasonable?
> 
> I like it, and it doesn''t look too far away from how others are
abusing
> f_version.  Have you tried with NFS?  I don''t think it''ll
hurt, but NFS
> loves to surprise me.
Mm, no, I hadn''t.  I''ll give it a go tomorrow.  What could go
wrong? :)

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miao Xie

2013-Jun-05 01:19 UTC

head link

Re: [PATCH 1/6] btrfs: set readdir f_pos only after filldir

On 	tue, 4 Jun 2013 15:17:55 -0700, Zach Brown wrote:> The only time we need to advance f_pos is after we''ve successfully
given
> a result to userspace via filldir.  This simplification gets rid of the
> is_curr variable used to update f_pos for the delayed item readdir
> entries.
> 
> Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
>  fs/btrfs/delayed-inode.c |  5 +++--
>  fs/btrfs/inode.c         | 17 +++++++----------
>  2 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 5615eac..4d846a2 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file
*filp, void *dirent,
>  			continue;
>  		}
>  
> -		filp->f_pos = curr->key.offset;
> -
>  		di = (struct btrfs_dir_item *)curr->data;
>  		name = (char *)(di + 1);
>  		name_len = le16_to_cpu(di->name_len);
> @@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file
*filp, void *dirent,
>  		over = filldir(dirent, name, name_len, curr->key.offset,
>  			       location.objectid, d_type);
>  
> +		if (!over)
> +			filp->f_pos = curr->key.offset + 1;
> +
>  		if (atomic_dec_and_test(&curr->refs))
>  			kfree(curr);
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b11a95e..6a5784b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
>  	char tmp_name[32];
>  	char *name_ptr;
>  	int name_len;
> -	int is_curr = 0;	/* filp->f_pos points to the current index? */
>  
>  	/* FIXME, use a real flag for deciding about the key type */
>  	if (root->fs_info->tree_root == root)
> @@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void
*dirent,
>  						  found_key.offset))
>  			goto next;
>  
> -		filp->f_pos = found_key.offset;
> -		is_curr = 1;
> -
>  		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
>  		di_cur = 0;
>  		di_total = btrfs_item_size(leaf, item);
> @@ -5130,6 +5126,9 @@ skip:
>  
>  			if (over)
>  				goto nopos;
> +
> +			filp->f_pos = found_key.offset + 1;
> +
>  			di_len = btrfs_dir_name_len(leaf, di) +
>  				 btrfs_dir_data_len(leaf, di) + sizeof(*di);
>  			di_cur += di_len;
> @@ -5140,23 +5139,21 @@ next:
>  	}
>  
>  	if (key_type == BTRFS_DIR_INDEX_KEY) {
> -		if (is_curr)
> -			filp->f_pos++;
>  		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
>  						      &ins_list);
>  		if (ret)
>  			goto nopos;
>  	}
>  
> -	/* Reached end of directory/root. Bump pos past the last item. */
> -	if (key_type == BTRFS_DIR_INDEX_KEY)
> +	/* Reached end of directory/root */
> +	if (key_type == BTRFS_DIR_INDEX_KEY) {
>  		/*
>  		 * 32-bit glibc will use getdents64, but then strtol -
>  		 * so the last number we can serve is this.
>  		 */
>  		filp->f_pos = 0x7fffffff;
> -	else
> -		filp->f_pos++;
> +	}
> +
>  nopos:
>  	ret = 0;
>  err:
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miao Xie

2013-Jun-05 02:34 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
>> Quoting Zach Brown (2013-06-04 18:17:54)
>>> Hi gang,
>>>
>>> I finally sat down to fix that readdir hang that has been in the
back
>>> of my mind for a while.  I *hope* that the fix is pretty simple:
just
>>> don''t manufacture a fake f_pos, I *think* we can abuse
f_version as an
>>> indicator that we shouldn''t return entries.  Does this
look reasonable?
>>
>> I like it, and it doesn''t look too far away from how others
are abusing
>> f_version.  Have you tried with NFS?  I don''t think
it''ll hurt, but NFS
>> loves to surprise me.
> 
> Mm, no, I hadn''t.  I''ll give it a go tomorrow.  What
could go wrong? :)
If we can not use f_version, we can use private_data. I think this variant is
safe.

Miao
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Jun-05 13:36 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie
wrote:> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> >> Quoting Zach Brown (2013-06-04 18:17:54)
> >>> Hi gang,
> >>>
> >>> I finally sat down to fix that readdir hang that has been in
the back
> >>> of my mind for a while.  I *hope* that the fix is pretty
simple: just
> >>> don''t manufacture a fake f_pos, I *think* we can
abuse f_version as an
> >>> indicator that we shouldn''t return entries.  Does
this look reasonable?
> >>
> >> I like it, and it doesn''t look too far away from how
others are abusing
> >> f_version.  Have you tried with NFS?  I don''t think
it''ll hurt, but NFS
> >> loves to surprise me.
> > 
> > Mm, no, I hadn''t.  I''ll give it a go tomorrow.  What
could go wrong? :)
> 
> If we can not use f_version, we can use private_data. I think this variant
is
> safe.
private_data is used within the ioctl user transactions, so a
readdir(mountpoint) with a user transaction running can break it.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miao Xie

2013-Jun-06 01:35 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On 	wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:> On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
>> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
>>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
>>>> Quoting Zach Brown (2013-06-04 18:17:54)
>>>>> Hi gang,
>>>>>
>>>>> I finally sat down to fix that readdir hang that has been
in the back
>>>>> of my mind for a while.  I *hope* that the fix is pretty
simple: just
>>>>> don''t manufacture a fake f_pos, I *think* we can
abuse f_version as an
>>>>> indicator that we shouldn''t return entries.  Does
this look reasonable?
>>>>
>>>> I like it, and it doesn''t look too far away from how
others are abusing
>>>> f_version.  Have you tried with NFS?  I don''t think
it''ll hurt, but NFS
>>>> loves to surprise me.
>>>
>>> Mm, no, I hadn''t.  I''ll give it a go tomorrow. 
What could go wrong? :)
>>
>> If we can not use f_version, we can use private_data. I think this
variant is
>> safe.
> 
> private_data is used within the ioctl user transactions, so a
> readdir(mountpoint) with a user transaction running can break it.
don''t worry, we can allocate a structure to keep both transaction
handle and the information
of readdir, just like ext3/ext4. It is a flexible way and we can extend the
structure to keep
more information if need in the future.

Beside the above method, we also can abuse the low bits of private_data to
indicator that
we shouldn''t return entries.

Thanks
Miao
> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Jun-06 13:55 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie
wrote:> On 	wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:
> > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
> >> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> >>>> Quoting Zach Brown (2013-06-04 18:17:54)
> >>>>> Hi gang,
> >>>>>
> >>>>> I finally sat down to fix that readdir hang that has
been in the back
> >>>>> of my mind for a while.  I *hope* that the fix is
pretty simple: just
> >>>>> don''t manufacture a fake f_pos, I *think* we
can abuse f_version as an
> >>>>> indicator that we shouldn''t return entries. 
Does this look reasonable?
> >>>>
> >>>> I like it, and it doesn''t look too far away from
how others are abusing
> >>>> f_version.  Have you tried with NFS?  I don''t
think it''ll hurt, but NFS
> >>>> loves to surprise me.
> >>>
> >>> Mm, no, I hadn''t.  I''ll give it a go
tomorrow.  What could go wrong? :)
> >>
> >> If we can not use f_version, we can use private_data. I think this
variant is
> >> safe.
> > 
> > private_data is used within the ioctl user transactions, so a
> > readdir(mountpoint) with a user transaction running can break it.
> 
> don''t worry, we can allocate a structure to keep both transaction
handle and the information
> of readdir, just like ext3/ext4. It is a flexible way and we can extend the
structure to keep
> more information if need in the future.
> 
> Beside the above method, we also can abuse the low bits of private_data to
indicator that
> we shouldn''t return entries.
Allocating a full structure for private_data sounds better than directly
modifying the pointer value itself.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-Jun-06 14:32 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Quoting David Sterba (2013-06-06 09:55:50)> On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie wrote:
> > On    wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:
> > > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
> > >> On         tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> > >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason
wrote:
> > >>>> Quoting Zach Brown (2013-06-04 18:17:54)
> > >>>>> Hi gang,
> > >>>>>
> > >>>>> I finally sat down to fix that readdir hang that
has been in the back
> > >>>>> of my mind for a while.  I *hope* that the fix is
pretty simple: just
> > >>>>> don''t manufacture a fake f_pos, I
*think* we can abuse f_version as an
> > >>>>> indicator that we shouldn''t return
entries.  Does this look reasonable?
> > >>>>
> > >>>> I like it, and it doesn''t look too far away
from how others are abusing
> > >>>> f_version.  Have you tried with NFS?  I
don''t think it''ll hurt, but NFS
> > >>>> loves to surprise me.
> > >>>
> > >>> Mm, no, I hadn''t.  I''ll give it a go
tomorrow.  What could go wrong? :)
> > >>
> > >> If we can not use f_version, we can use private_data. I think
this variant is
> > >> safe.
> > > 
> > > private_data is used within the ioctl user transactions, so a
> > > readdir(mountpoint) with a user transaction running can break it.
> > 
> > don''t worry, we can allocate a structure to keep both
transaction handle and the information
> > of readdir, just like ext3/ext4. It is a flexible way and we can
extend the structure to keep
> > more information if need in the future.
> > 
> > Beside the above method, we also can abuse the low bits of
private_data to indicator that
> > we shouldn''t return entries.
> 
> Allocating a full structure for private_data sounds better than directly
> modifying the pointer value itself.
I''d actually rather tag the pointers than go through kmalloc, we just
need one bit (maybe that really just shows how badly we''ve corrupted
poor Miao).  But, we''re not there yet, I think Zach''s initial
patch will
work fine.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jun-10 22:39 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown
wrote:> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> > Quoting Zach Brown (2013-06-04 18:17:54)
> > > Hi gang,
> > > 
> > > I finally sat down to fix that readdir hang that has been in the
back
> > > of my mind for a while.  I *hope* that the fix is pretty simple:
just
> > > don''t manufacture a fake f_pos, I *think* we can abuse
f_version as an
> > > indicator that we shouldn''t return entries.  Does this
look reasonable?
> > 
> > I like it, and it doesn''t look too far away from how others
are abusing
> > f_version.  Have you tried with NFS?  I don''t think
it''ll hurt, but NFS
> > loves to surprise me.
> 
> Mm, no, I hadn''t.  I''ll give it a go tomorrow.  What
could go wrong? :)
Or a week later.  Pretty close!

I couldn''t get NFS to break.  Clients see new entries created directly
in the exported btrfs and on either of noac and actime=1 client mounts.
For whatever that''s worth.

But I did find that I''d broken the case of trying to re-enable readdir
results by seeking past the last entry (which happens to be the current
f_pos now that we''re using f_version).

Here''s the incremental fix against what Josef has in -next. 
I''m cool
with either squashing or just committing it.

- z

Subject: [PATCH] btrfs: reset f_version when seeking to pos

Commit 63e3dfe ("btrfs: fix readdir hang with offsets past INT_MAX")
switched to using f_version to stop readdir results instead of setting a
large f_pos.

It inadvertantly changed behaviour in the case where an app specifically
seeks to one past the last valid dent->d_off it has seen.  Previously
f_pos would have changed from the fake f_pos to this new f_pos which
would let readdir return new entries.

But now that it''s using f_version it might not have seen new entries.
generic_file_llseek() won''t clear f_version if the desirned pos happens
to be the current f_pos.

So we add a little wrapper to notice this case and clear f_version so
that entries can be seen in this case.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1059c90..590c274 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4997,6 +4997,23 @@ unsigned char btrfs_filetype_table[] = {
  * which prevents readdir results until seek resets f_pos and f_version.
  */
 #define BTRFS_READDIR_EOF ~0ULL
+static loff_t btrfs_dir_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_mapping->host;
+	loff_t ret;
+
+	/*
+	 * f_version isn''t reset if a seek is attempted to the current pos. 
A
+	 * caller can be trying to see more entries by seeking past the last
+	 * entry to the current pos after creating a new entry.
+	 */
+	mutex_lock(&inode->i_mutex);
+	ret = generic_file_llseek(file, offset, whence);
+	if (ret == offset && file->f_version == BTRFS_READDIR_EOF)
+		file->f_version = 0;
+	mutex_unlock(&inode->i_mutex);
+	return ret;
+}

 static int btrfs_real_readdir(struct file *filp, void *dirent,
 			      filldir_t filldir)
@@ -8642,7 +8659,7 @@ static const struct inode_operations
btrfs_dir_ro_inode_operations = {
 };

 static const struct file_operations btrfs_dir_file_operations = {
-	.llseek		= generic_file_llseek,
+	.llseek		= btrfs_dir_llseek,
 	.read		= generic_read_dir,
 	.readdir	= btrfs_real_readdir,
 	.unlocked_ioctl	= btrfs_ioctl,
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-Jun-12 12:59 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Quoting Zach Brown (2013-06-10 18:39:58)> On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown wrote:
> > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> > > Quoting Zach Brown (2013-06-04 18:17:54)
> > > > Hi gang,
> > > > 
> > > > I finally sat down to fix that readdir hang that has been in
the back
> > > > of my mind for a while.  I *hope* that the fix is pretty
simple: just
> > > > don''t manufacture a fake f_pos, I *think* we can
abuse f_version as an
> > > > indicator that we shouldn''t return entries.  Does
this look reasonable?
> > > 
> > > I like it, and it doesn''t look too far away from how
others are abusing
> > > f_version.  Have you tried with NFS?  I don''t think
it''ll hurt, but NFS
> > > loves to surprise me.
> > 
> > Mm, no, I hadn''t.  I''ll give it a go tomorrow.  What
could go wrong? :)
> 
> Or a week later.  Pretty close!
> 
> I couldn''t get NFS to break.  Clients see new entries created
directly
> in the exported btrfs and on either of noac and actime=1 client mounts.
> For whatever that''s worth.
Great.
> 
> But I did find that I''d broken the case of trying to re-enable
readdir
> results by seeking past the last entry (which happens to be the current
> f_pos now that we''re using f_version).
> 
> Here''s the incremental fix against what Josef has in -next. 
I''m cool
> with either squashing or just committing it.
Lets squash it in, Josef loves to rebase.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jul-01 12:54 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown
wrote:> Hi gang,
> 
> I finally sat down to fix that readdir hang that has been in the back
> of my mind for a while.  I *hope* that the fix is pretty simple: just
> don''t manufacture a fake f_pos, I *think* we can abuse f_version
as an
> indicator that we shouldn''t return entries.  Does this look
reasonable?
> 
> We still have the problem that we can generate valid large f_pos values
> that can confuse 32bit userspace, but that''s a different problem. 
I
> think we''ll want filldir generation of EOVERFLOW like what exists
for
> large inodes. 
> 
> The rest of the patches are cleanups that I saw when absorbing the
> code.  It''s all lightly tested with xfstests but it
wouldn''t surprise
> me if I missed something so review is appreciated.
> 
> Thanks!
> 
One of these patches is making new entries not show up in readdir.  This was
discovered while running stress.sh overnight, it complained about files not
matching but when they were checked the files matched.  Dropping the entire
series made stress.sh run fine.  So I''m dropping these for the next
merge window
but I''ll dig into it and try and figure out what was causing the
problem.
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-Jul-01 13:18 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Quoting Josef Bacik (2013-07-01 08:54:35)> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:
> > Hi gang,
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don''t manufacture a fake f_pos, I *think* we can abuse
f_version as an
> > indicator that we shouldn''t return entries.  Does this look
reasonable?
> > 
> > We still have the problem that we can generate valid large f_pos
values
> > that can confuse 32bit userspace, but that''s a different
problem.  I
> > think we''ll want filldir generation of EOVERFLOW like what
exists for
> > large inodes. 
> > 
> > The rest of the patches are cleanups that I saw when absorbing the
> > code.  It''s all lightly tested with xfstests but it
wouldn''t surprise
> > me if I missed something so review is appreciated.
> > 
> > Thanks!
> > 
> 
> One of these patches is making new entries not show up in readdir.  This
was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I''m dropping these for the
next merge window
> but I''ll dig into it and try and figure out what was causing the
problem.
Unfortunately I''ve only triggered this on flash, and the run takes
about
two hours to trigger.  Trying now with some extra printks to see if I
can nail it down

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jul-01 16:10 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

> > code.  It''s all lightly tested with xfstests but it
wouldn''t surprise
> > me if I missed something so review is appreciated.
*mmm, hmmm*
> One of these patches is making new entries not show up in readdir.  This
was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I''m dropping these for the
next merge window
> but I''ll dig into it and try and figure out what was causing the
problem.
Nerts.  It''s got to be the delayed inode stuff.

Maybe it''s some unlink/recreate pattern?  Is this a thing that
stress.sh
does?  (Where''s stress.sh live?)

- z 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-Jul-01 17:18 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Quoting Zach Brown (2013-07-01 12:10:02)> > > code.  It''s all lightly tested with xfstests but it
wouldn''t surprise
> > > me if I missed something so review is appreciated.
> 
> *mmm, hmmm*
> 
> > One of these patches is making new entries not show up in readdir. 
This was
> > discovered while running stress.sh overnight, it complained about
files not
> > matching but when they were checked the files matched.  Dropping the
entire
> > series made stress.sh run fine.  So I''m dropping these for
the next merge window
> > but I''ll dig into it and try and figure out what was causing
the problem.
> 
> Nerts.  It''s got to be the delayed inode stuff.
> 
> Maybe it''s some unlink/recreate pattern?  Is this a thing that
stress.sh
> does?  (Where''s stress.sh live?)
It''s an old namesys tool, I''ve copied it here:

http://masoncoding.com/mason/tools/stress.sh

My command line:

stress.sh -n 50 -s -c /build/linux /mnt

Basically its:

1) Make a list of all files in /build/linux and their md5sums
2) Start 50 procs
3) Each proc copies /build/linux into /mnt/stress/proc_num/
4) Each proc compares the md5sums of its private copy with the original
master
5) Each proc deletes the private copy
6) Repeat steps 2-5 forever.

The most likely cause of the bug I''m seeing is readdir not finding new
files.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jul-11 23:19 UTC

head link

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

On Mon, Jul 01, 2013 at 08:54:35AM -0400, Josef Bacik
wrote:> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don''t manufacture a fake f_pos, I *think* we can abuse
f_version as an
> > indicator that we shouldn''t return entries.  Does this look
reasonable?
>
> One of these patches is making new entries not show up in readdir.  This
was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I''m dropping these for the
next merge window
> but I''ll dig into it and try and figure out what was causing the
problem.
OK, how about this.

First, just drop the series.  Most of it was opportunistic cleanups that
I saw as I was reading the code.  But it certainly isn''t a
comprehensive
cleanup.  We''d still want to go back later and fix how inefficient the
delayed item data structures are for readdir.  So from some perspective
it''s just risky churn with little upside.

And I think I found a much simpler way to stop the current readdir from
looping instead of mucking around with f_version.  Just use LLONG_MAX if
entries have already overflowed INT_MAX.

We''d still want real freed offset reuse some day, but that''s a
bunch of
work that''ll have to be done very carefully.  This will at least stop
64bit apps from failing with offsets past 64bits and is very low risk.

So just add this patch and forget about the rest of the series?

It''ll still technically conflict with the s/filp->f_pos/ctx->pos/
in the
readdir interface change in -next but that fixup is trivial.

- z

From e295deb0f56f846738ee3dec63fe0350f3952503 Mon Sep 17 00:00:00 2001
From: Zach Brown <zab@redhat.com>
Date: Wed, 10 Jul 2013 19:48:51 -0400
Subject: [PATCH] btrfs: don''t loop on large offsets in readdir

When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX.  It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let''s set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t.   Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c157482..8583e8d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5186,14 +5186,31 @@ next:
 	}

 	/* Reached end of directory/root. Bump pos past the last item. */
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		/*
-		 * 32-bit glibc will use getdents64, but then strtol -
-		 * so the last number we can serve is this.
-		 */
-		filp->f_pos = 0x7fffffff;
-	else
-		filp->f_pos++;
+	filp->f_pos++;
+
+	/* 
+	 * Stop new entries from being returned after we return the last
+	 * entry.
+	 *
+	 * New directory entries are assigned a strictly increasing
+	 * offset.  This means that new entries created during readdir
+	 * are *guaranteed* to be seen in the future by that readdir.
+	 * This has broken buggy programs which operate on names as
+	 * they''re returned by readdir.  Until we re-use freed offsets
+	 * we have this hack to stop new entries from being returned
+	 * under the assumption that they''ll never reach this huge
+	 * offset.
+	 *
+	 * This is being careful not to overflow 32bit loff_t unless the
+	 * last entry requires it because doing so has broken 32bit apps
+	 * in the past.
+	 */
+	if (key_type == BTRFS_DIR_INDEX_KEY) {
+		if (filp->f_pos >= INT_MAX)
+			filp->f_pos = LLONG_MAX;
+		else
+			filp->f_pos = INT_MAX;
+	}
 nopos:
 	ret = 0;
 err:
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2013 - [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

[PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

[PATCH 1/6] btrfs: set readdir f_pos only after filldir

[PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX

[PATCH 3/6] btrfs: trivial delayed item readdir list cleanups

[PATCH 4/6] btrfs: simplify finding next/prev delayed items

[PATCH 5/6] btrfs: add helper to get delayed item root

[PATCH 6/6] btrfs: get fewer delayed item refs during readdir

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 1/6] btrfs: set readdir f_pos only after filldir

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups

Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups