thr3ads.net - Btrfs devel - [PATCH 0/6] Btrfs commit fixes, async subvol operations [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Sage Weil

2010-Oct-25 19:07 UTC

[PATCH 0/6] Btrfs commit fixes, async subvol operations

Hi Chris,

This is the extent of my current queue of Btrfs snapshot/subvol/commit 
stuff. Most of these were posted several months ago.  Can be sent 
upstream during this merge window?  Not having this functionality is 
becoming a bit of a roadblock for our efforts to keep the Ceph data in a 
consistent state.

These patches are also available from

	git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls

The first patch is strictly a bug fix for a deadlock in 
btrfs_commit_transaction().

The next few patches are a repost (with a few minor revisions) of the 
async snapshot/subvolume ioctls I originally posted last spring.  They 
include:

 - Some async commit helper functions
 - Start and wait sync ioctls, for initiating and waiting for a sync
 - An ioctl to start a snapshot creation asynchronously, such that you 
   don''t have to wait for the full commit to disk.  The interface is
cleaned
   up a bit from the original version.

There is also a patch that changes the SNAP_DESTROY ioctl to not do a 
commit before returning.  The rationale here is there is no obvious 
reason (to me at least) why the snapshot removal should be on disk 
before returning; rm(2) and unlink(2) certainly don''t do that.  If the 
user disagrees, they can call sync(2).  If you would prefer, I also have 
a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn''t 
change any existing behavior.

The last item is a change to SNAP_DESTROY to allow deletion of a 
snapshot when the user owns the subvol''s root inode and the parent 
directory permissions are such that we would have allowed an rmdir(2).  
Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
semantics completely (except for the empty directory check) by 
duplicating some VFS code.  Whether we want weaker semantics, duplicated 
code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
is distinct from a similar patch (also from Goffredo) that allows 
rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
subvol to be deleted by a non-root user.  As long as I can do that, my 
daemon doesn''t have to run as root and I''m a happy camper.  :)

If anybody has any questions or issues with any of this, please let me 
know so I can revise the patches accordingly.

Thanks!
sage

---


Sage Weil (6):
  Btrfs: fix deadlock in btrfs_commit_transaction
  Btrfs: async transaction commit
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: make SNAP_DESTROY async
  Btrfs: allow subvol deletion by owner

 fs/btrfs/ctree.h       |    1 +
 fs/btrfs/disk-io.c     |    1 +
 fs/btrfs/ioctl.c       |  152 +++++++++++++++++++++++++++++++++++-----
 fs/btrfs/ioctl.h       |    9 +++
 fs/btrfs/transaction.c |  183 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/transaction.h |    4 +
 security/security.c    |    1 +
 7 files changed, 326 insertions(+), 25 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it''s possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/transaction.c |   12 ++++--------
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 66e4c66..bf399ea 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 			     struct btrfs_root *root)
 {
 	unsigned long joined = 0;
-	unsigned long timeout = 1;
 	struct btrfs_transaction *cur_trans;
 	struct btrfs_transaction *prev_trans = NULL;
 	DEFINE_WAIT(wait);
@@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 			snap_pending = 1;
 
 		WARN_ON(cur_trans != trans->transaction);
-		if (cur_trans->num_writers > 1)
-			timeout = MAX_SCHEDULE_TIMEOUT;
-		else if (should_grow)
-			timeout = 1;
-
 		mutex_unlock(&root->fs_info->trans_mutex);
 
 		if (flush_on_commit || snap_pending) {
@@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 				TASK_UNINTERRUPTIBLE);
 
 		smp_mb();
-		if (cur_trans->num_writers > 1 || should_grow)
-			schedule_timeout(timeout);
+		if (cur_trans->num_writers > 1)
+			schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+		else if (should_grow)
+			schedule_timeout(1);
 
 		mutex_lock(&root->fs_info->trans_mutex);
 		finish_wait(&cur_trans->writer_wait, &wait);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 2/6] Btrfs: async transaction commit

Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the
''middle'' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ctree.h       |    1 +
 fs/btrfs/disk-io.c     |    1 +
 fs/btrfs/transaction.c |  119 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/transaction.h |    3 +
 4 files changed, 124 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eaf286a..4c3c20b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -863,6 +863,7 @@ struct btrfs_fs_info {
 	struct btrfs_transaction *running_transaction;
 	wait_queue_head_t transaction_throttle;
 	wait_queue_head_t transaction_wait;
+	wait_queue_head_t transaction_blocked_wait;
 	wait_queue_head_t async_submit_wait;
 
 	struct btrfs_super_block super_copy;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..034abc6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1680,6 +1680,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
 	init_waitqueue_head(&fs_info->transaction_throttle);
 	init_waitqueue_head(&fs_info->transaction_wait);
+	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 
 	__setup_root(4096, 4096, 4096, 4096, tree_root,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bf399ea..d720106 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -988,6 +988,123 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info)
 	return ret;
 }
 
+/*
+ * wait for the current transaction commit to start and block subsequent
+ * transaction joins
+ */
+static void wait_current_trans_commit_start(struct btrfs_root *root,
+					    struct btrfs_transaction *trans)
+{
+	DEFINE_WAIT(wait);
+
+	if (trans->in_commit)
+		return;
+
+	while (1) {
+		prepare_to_wait(&root->fs_info->transaction_blocked_wait,
&wait,
+				TASK_UNINTERRUPTIBLE);
+		if (trans->in_commit) {
+			finish_wait(&root->fs_info->transaction_blocked_wait,
+				    &wait);
+			break;
+		}
+		mutex_unlock(&root->fs_info->trans_mutex);
+		schedule();
+		mutex_lock(&root->fs_info->trans_mutex);
+		finish_wait(&root->fs_info->transaction_blocked_wait, &wait);
+	}
+}
+
+/*
+ * wait for the current transaction to start and then become unblocked.
+ * caller holds ref.
+ */
+static void wait_current_trans_commit_start_and_unblock(struct btrfs_root
*root,
+					 struct btrfs_transaction *trans)
+{
+	DEFINE_WAIT(wait);
+
+	if (trans->commit_done || (trans->in_commit &&
!trans->blocked))
+		return;
+
+	while (1) {
+		prepare_to_wait(&root->fs_info->transaction_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (trans->commit_done ||
+		    (trans->in_commit && !trans->blocked)) {
+			finish_wait(&root->fs_info->transaction_wait,
+				    &wait);
+			break;
+		}
+		mutex_unlock(&root->fs_info->trans_mutex);
+		schedule();
+		mutex_lock(&root->fs_info->trans_mutex);
+		finish_wait(&root->fs_info->transaction_wait,
+			    &wait);
+	}
+}
+
+/*
+ * commit transactions asynchronously. once btrfs_commit_transaction_async
+ * returns, any subsequent transaction will not be allowed to join.
+ */
+struct btrfs_async_commit {
+	struct btrfs_trans_handle *newtrans;
+	struct btrfs_root *root;
+	struct delayed_work work;
+};
+
+static void do_async_commit(struct work_struct *work)
+{
+	struct btrfs_async_commit *ac +		container_of(work, struct btrfs_async_commit,
work.work);
+	
+	btrfs_commit_transaction(ac->newtrans, ac->root);
+	kfree(ac);
+}
+
+int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
+				   struct btrfs_root *root,
+				   int wait_for_unblock)
+{
+	struct btrfs_async_commit *ac;
+	struct btrfs_transaction *cur_trans;
+
+	ac = kmalloc(sizeof(*ac), GFP_NOFS);
+	BUG_ON(!ac);
+
+	INIT_DELAYED_WORK(&ac->work, do_async_commit);
+	ac->root = root;
+	ac->newtrans = btrfs_join_transaction(root, 0);
+
+	/* take transaction reference */
+	mutex_lock(&root->fs_info->trans_mutex);
+	cur_trans = trans->transaction;
+	cur_trans->use_count++;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	btrfs_end_transaction(trans, root);
+	schedule_delayed_work(&ac->work, 0);
+
+	/* wait for transaction to start and unblock */
+	mutex_lock(&root->fs_info->trans_mutex);
+	if (wait_for_unblock)
+		wait_current_trans_commit_start_and_unblock(root, cur_trans);
+	else
+		wait_current_trans_commit_start(root, cur_trans);
+	put_transaction(cur_trans);
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	return 0;
+}
+
+/*
+ * btrfs_transaction state sequence:
+ *    in_commit = 0, blocked = 0  (initial)
+ *    in_commit = 1, blocked = 1
+ *    blocked = 0
+ *    commit_done = 1
+ */
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
@@ -1038,6 +1155,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 
 	trans->transaction->in_commit = 1;
 	trans->transaction->blocked = 1;
+	wake_up(&root->fs_info->transaction_blocked_wait);
+
 	if (cur_trans->list.prev != &root->fs_info->trans_list) {
 		prev_trans = list_entry(cur_trans->list.prev,
 					struct btrfs_transaction, list);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index e104986..8972c62 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -104,6 +104,9 @@ int btrfs_defrag_root(struct btrfs_root *root, int
cacheonly);
 int btrfs_clean_old_snapshots(struct btrfs_root *root);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root);
+int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
+				   struct btrfs_root *root,
+				   int wait_for_unblock);
 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root);
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans,
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls

START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn''t exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it''s commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn''t wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c       |   34 +++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h       |    2 +
 fs/btrfs/transaction.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/transaction.h |    1 +
 4 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..6306051 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1984,6 +1984,36 @@ long btrfs_ioctl_trans_end(struct file *file)
 	return 0;
 }
 
+static noinline long btrfs_ioctl_start_sync(struct file *file, void __user
*argp)
+{
+	struct btrfs_root *root = BTRFS_I(file->f_dentry->d_inode)->root;
+	struct btrfs_trans_handle *trans;
+	u64 transid;
+
+	trans = btrfs_start_transaction(root, 0);
+	transid = trans->transid;
+	btrfs_commit_transaction_async(trans, root, 0);
+
+	if (argp)
+		if (copy_to_user(argp, &transid, sizeof(transid)))
+			return -EFAULT;
+	return 0;
+}
+
+static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user
*argp)
+{
+	struct btrfs_root *root = BTRFS_I(file->f_dentry->d_inode)->root;
+	u64 transid;
+
+	if (argp) {
+		if (copy_from_user(&transid, argp, sizeof(transid)))
+			return -EFAULT;
+	} else {
+		transid = 0;  /* current trans */
+	}
+	return btrfs_wait_for_commit(root, transid);
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -2034,6 +2064,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SYNC:
 		btrfs_sync_fs(file->f_dentry->d_sb, 1);
 		return 0;
+	case BTRFS_IOC_START_SYNC:
+		return btrfs_ioctl_start_sync(file, argp);
+	case BTRFS_IOC_WAIT_SYNC:
+		return btrfs_ioctl_wait_sync(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..e9a2f7e 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -178,4 +178,6 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
 				    struct btrfs_ioctl_space_args)
+#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64)
+#define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d720106..932bc03 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -270,6 +270,58 @@ static noinline int wait_for_commit(struct btrfs_root
*root,
 	return 0;
 }
 
+int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+{
+	struct btrfs_transaction *cur_trans = NULL, *t;
+	int ret;
+
+	mutex_lock(&root->fs_info->trans_mutex);
+
+	ret = 0;
+	if (transid) {
+		if (transid <= root->fs_info->last_trans_committed)
+			goto out_unlock;
+
+		/* find specified transaction */
+		list_for_each_entry(t, &root->fs_info->trans_list, list) {
+			if (t->transid == transid) {
+				cur_trans = t;
+				break;
+			}
+			if (t->transid > transid)
+				break;
+		}
+		ret = -EINVAL;
+		if (!cur_trans)
+			goto out_unlock;  /* bad transid */
+	} else {
+		/* find newest transaction that is committing | committed */
+		list_for_each_entry_reverse(t, &root->fs_info->trans_list,
+					    list) {
+			if (t->in_commit) {
+				if (t->commit_done)
+					goto out_unlock;
+				cur_trans = t;
+				break;
+			}
+		}
+		if (!cur_trans)
+			goto out_unlock;  /* nothing committing|committed */
+	}
+
+	cur_trans->use_count++;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	wait_for_commit(root, cur_trans);
+
+	mutex_lock(&root->fs_info->trans_mutex);
+	put_transaction(cur_trans);
+	ret = 0;
+out_unlock:
+	mutex_unlock(&root->fs_info->trans_mutex);
+	return ret;
+}
+
 #if 0
 /*
  * rate limit against the drop_snapshot code.  This helps to slow down new
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 8972c62..33dec24 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -93,6 +93,7 @@ struct btrfs_trans_handle *btrfs_join_transaction(struct
btrfs_root *root,
 						  int num_blocks);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *r,
 							 int num_blocks);
+int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
 int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root);
 int btrfs_commit_tree_roots(struct btrfs_trans_handle *trans,
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 4/6] Btrfs: add SNAP_CREATE_ASYNC ioctl

Create a snap without waiting for it to commit to disk.  The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.

We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c |   89 ++++++++++++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/ioctl.h |    7 ++++
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6306051..679b8a8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -224,7 +224,8 @@ static int btrfs_ioctl_getversion(struct file *file, int
__user *arg)
 
 static noinline int create_subvol(struct btrfs_root *root,
 				  struct dentry *dentry,
-				  char *name, int namelen)
+				  char *name, int namelen,
+				  u64 *async_transid)
 {
 	struct btrfs_trans_handle *trans;
 	struct btrfs_key key;
@@ -338,13 +339,19 @@ static noinline int create_subvol(struct btrfs_root *root,
 
 	d_instantiate(dentry, btrfs_lookup_dentry(dir, dentry));
 fail:
-	err = btrfs_commit_transaction(trans, root);
+	if (async_transid) {
+		*async_transid = trans->transid;
+		err = btrfs_commit_transaction_async(trans, root, 1);
+	} else {
+		err = btrfs_commit_transaction(trans, root);
+	}
 	if (err && !ret)
 		ret = err;
 	return ret;
 }
 
-static int create_snapshot(struct btrfs_root *root, struct dentry *dentry)
+static int create_snapshot(struct btrfs_root *root, struct dentry *dentry,
+			   char *name, int namelen, u64 *async_transid)
 {
 	struct inode *inode;
 	struct btrfs_pending_snapshot *pending_snapshot;
@@ -373,7 +380,14 @@ static int create_snapshot(struct btrfs_root *root, struct
dentry *dentry)
 
 	list_add(&pending_snapshot->list,
 		 &trans->transaction->pending_snapshots);
-	ret = btrfs_commit_transaction(trans, root->fs_info->extent_root);
+	if (async_transid) {
+		*async_transid = trans->transid;
+		ret = btrfs_commit_transaction_async(trans,
+				     root->fs_info->extent_root, 1);
+	} else {
+		ret = btrfs_commit_transaction(trans,
+					       root->fs_info->extent_root);
+	}
 	BUG_ON(ret);
 
 	ret = pending_snapshot->error;
@@ -412,7 +426,8 @@ static inline int btrfs_may_create(struct inode *dir, struct
dentry *child)
  */
 static noinline int btrfs_mksubvol(struct path *parent,
 				   char *name, int namelen,
-				   struct btrfs_root *snap_src)
+				   struct btrfs_root *snap_src,
+				   u64 *async_transid)
 {
 	struct inode *dir  = parent->dentry->d_inode;
 	struct dentry *dentry;
@@ -443,10 +458,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
 		goto out_up_read;
 
 	if (snap_src) {
-		error = create_snapshot(snap_src, dentry);
+		error = create_snapshot(snap_src, dentry,
+					name, namelen, async_transid);
 	} else {
 		error = create_subvol(BTRFS_I(dir)->root, dentry,
-				      name, namelen);
+				      name, namelen, async_transid);
 	}
 	if (!error)
 		fsnotify_mkdir(dir, dentry);
@@ -801,8 +817,10 @@ out_unlock:
 	return ret;
 }
 
-static noinline int btrfs_ioctl_snap_create(struct file *file,
-					    void __user *arg, int subvol)
+static noinline int btrfs_ioctl_snap_create_transid(struct file *file,
+						    void __user *arg,
+						    int subvol,
+						    u64 *transid)
 {
 	struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
 	struct btrfs_ioctl_vol_args *vol_args;
@@ -826,7 +844,7 @@ static noinline int btrfs_ioctl_snap_create(struct file
*file,
 
 	if (subvol) {
 		ret = btrfs_mksubvol(&file->f_path, vol_args->name, namelen,
-				     NULL);
+				     NULL, transid);
 	} else {
 		struct inode *src_inode;
 		src_file = fget(vol_args->fd);
@@ -844,7 +862,8 @@ static noinline int btrfs_ioctl_snap_create(struct file
*file,
 			goto out;
 		}
 		ret = btrfs_mksubvol(&file->f_path, vol_args->name, namelen,
-				     BTRFS_I(src_inode)->root);
+				     BTRFS_I(src_inode)->root,
+				     transid);
 		fput(src_file);
 	}
 out:
@@ -852,6 +871,48 @@ out:
 	return ret;
 }
 
+static noinline int btrfs_async_vol_args(void __user *arg,
+			     struct btrfs_ioctl_vol_args **vol_args,
+			     u64 __user **ptransid)
+{
+	struct btrfs_ioctl_async_vol_args *async_args;
+
+	async_args = memdup_user(arg, sizeof(*async_args));
+	if (IS_ERR(async_args))
+		return PTR_ERR(async_args);
+	*vol_args = async_args->args;
+	*ptransid = async_args->transid;
+	kfree(async_args);
+	return 0;
+}
+
+static noinline int btrfs_ioctl_snap_create(struct file *file,
+					    void __user *arg, int subvol,
+					    int async)
+{
+	struct btrfs_ioctl_vol_args *vol_args = arg;
+	u64 __user *ptransid = NULL;
+	u64 transid = 0;
+	int ret;
+
+	if (async) {
+		ret = btrfs_async_vol_args(arg, &vol_args, &ptransid);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = btrfs_ioctl_snap_create_transid(file, vol_args,
+					      subvol, &transid);
+
+	if (!ret && async && ptransid) {
+		if (copy_to_user(ptransid, &transid,
+				 sizeof(transid)))
+			return -EFAULT;
+	}
+
+	return ret;
+}
+
 /*
  * helper to check if the subvolume references other subvolumes
  */
@@ -2028,9 +2089,11 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case FS_IOC_GETVERSION:
 		return btrfs_ioctl_getversion(file, argp);
 	case BTRFS_IOC_SNAP_CREATE:
-		return btrfs_ioctl_snap_create(file, argp, 0);
+		return btrfs_ioctl_snap_create(file, argp, 0, 0);
+	case BTRFS_IOC_SNAP_CREATE_ASYNC:
+		return btrfs_ioctl_snap_create(file, argp, 0, 1);
 	case BTRFS_IOC_SUBVOL_CREATE:
-		return btrfs_ioctl_snap_create(file, argp, 1);
+		return btrfs_ioctl_snap_create(file, argp, 1, 0);
 	case BTRFS_IOC_SNAP_DESTROY:
 		return btrfs_ioctl_snap_destroy(file, argp);
 	case BTRFS_IOC_DEFAULT_SUBVOL:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index e9a2f7e..8f28626 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -30,6 +30,11 @@ struct btrfs_ioctl_vol_args {
 	char name[BTRFS_PATH_NAME_MAX + 1];
 };
 
+struct btrfs_ioctl_async_vol_args {
+	struct btrfs_ioctl_vol_args *args;
+	__u64 *transid;
+};
+
 #define BTRFS_INO_LOOKUP_PATH_MAX 4080
 struct btrfs_ioctl_ino_lookup_args {
 	__u64 treeid;
@@ -180,4 +185,6 @@ struct btrfs_ioctl_space_args {
 				    struct btrfs_ioctl_space_args)
 #define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64)
 #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
+#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
+				   struct btrfs_ioctl_async_vol_args)
 #endif
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 5/6] Btrfs: make SNAP_DESTROY async

There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call
''sync''.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 679b8a8..9cbda86 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1365,7 +1365,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file
*file,
 		BUG_ON(ret);
 	}
 
-	ret = btrfs_commit_transaction(trans, root);
+	ret = btrfs_end_transaction(trans, root);
 	BUG_ON(ret);
 	inode->i_flags |= S_DEAD;
 out_up_write:
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:07 UTC

head link

[PATCH 6/6] Btrfs: allow subvol deletion by owner

Allow users to delete the snapshots/subvols they own.  When CAP_SYS_ADMIN
is not present, require that

 - the user has write and exec permission on the parent directory
 - security_inode_rmdir() doesn''t object
 - the user has write and exec permission on the subvol directory
 - the user owns the subvol directory
 - the directory and subvol append flags are not set

This is a bit more strict than the requirements for ''rm -f
subvol/*'',
which is allowed with just ''wx'' on non-owned non-sticky dirs. 
It is less
strict than ''rmdir subvol'', which has additional requirements
if the parent
directory is sticky.  It is less strict than ''rm -rf subvol'',
which would
require scanning the entire subvol to ensure all content is deletable.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c    |   27 ++++++++++++++++++++++++---
 security/security.c |    1 +
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9cbda86..90d2871 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1288,9 +1288,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file
*file,
 	int ret;
 	int err = 0;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args))
 		return PTR_ERR(vol_args);
@@ -1325,6 +1322,30 @@ static noinline int btrfs_ioctl_snap_destroy(struct file
*file,
 		goto out_dput;
 	}
 
+	/*
+	 * Allow subvol deletion if we own the subvol and we would
+	 * (approximately) be allowed to rmdir it.  Strictly speaking,
+	 * we could possibly delete everything with fewer permissions,
+	 * or might require more permissions to remove all files
+	 * contained by the subvol, but we aren''t trying to mimic
+	 * directory semantics perfectly.
+	 */
+	if (!capable(CAP_SYS_ADMIN)) {
+		err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+		if (err)
+			goto out_dput;
+		err = security_inode_rmdir(dir, dentry);
+		if (err)
+			goto out_dput;
+		err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
+		if (err)
+			goto out_dput;
+		err = -EPERM;
+		if (inode->i_uid != current_fsuid() || IS_APPEND(dir) ||
+		    IS_APPEND(inode))
+			goto out_dput;
+	}
+
 	dest = BTRFS_I(inode)->root;
 
 	mutex_lock(&inode->i_mutex);
diff --git a/security/security.c b/security/security.c
index c53949f..1c980ee 100644
--- a/security/security.c
+++ b/security/security.c
@@ -490,6 +490,7 @@ int security_inode_rmdir(struct inode *dir, struct dentry
*dentry)
 		return 0;
 	return security_ops->inode_rmdir(dir, dentry);
 }
+EXPORT_SYMBOL_GPL(security_inode_rmdir);
 
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode,
dev_t dev)
 {
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Oct-25 19:29 UTC

head link

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

On Mon, Oct 25, 2010 at 12:07:36PM -0700, Sage Weil
wrote:> Hi Chris,
> 
> This is the extent of my current queue of Btrfs snapshot/subvol/commit 
> stuff. Most of these were posted several months ago.  Can be sent 
> upstream during this merge window?  Not having this functionality is 
> becoming a bit of a roadblock for our efforts to keep the Ceph data in a 
> consistent state.
> 
> These patches are also available from
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls
> 
> The first patch is strictly a bug fix for a deadlock in 
> btrfs_commit_transaction().
> 
> The next few patches are a repost (with a few minor revisions) of the 
> async snapshot/subvolume ioctls I originally posted last spring.  They 
> include:
> 
>  - Some async commit helper functions
>  - Start and wait sync ioctls, for initiating and waiting for a sync
>  - An ioctl to start a snapshot creation asynchronously, such that you 
>    don''t have to wait for the full commit to disk.  The interface
is cleaned
>    up a bit from the original version.
> 
> There is also a patch that changes the SNAP_DESTROY ioctl to not do a 
> commit before returning.  The rationale here is there is no obvious 
> reason (to me at least) why the snapshot removal should be on disk 
> before returning; rm(2) and unlink(2) certainly don''t do that.  If
the
> user disagrees, they can call sync(2).  If you would prefer, I also have 
> a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that
doesn''t
> change any existing behavior.
These all look good to me and I''m pulling them in.
> 
> The last item is a change to SNAP_DESTROY to allow deletion of a 
> snapshot when the user owns the subvol''s root inode and the parent
> directory permissions are such that we would have allowed an rmdir(2).  
> Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
> semantics completely (except for the empty directory check) by 
> duplicating some VFS code.  Whether we want weaker semantics, duplicated 
> code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
> is distinct from a similar patch (also from Goffredo) that allows 
> rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
> subvol to be deleted by a non-root user.  As long as I can do that, my 
> daemon doesn''t have to run as root and I''m a happy
camper.  :)
Someone at the storage workshop mentioned that this subvol deletion
trick is slightly stronger than rm -rf, to make it include the same
level of permission checks would require testing all the directories in
the tree for permissions.

For now, could you please make a mount -o user_subvol_rm_allowed option?
(or something similar with a better name).

Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 19:41 UTC

head link

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

On Mon, 25 Oct 2010, Chris Mason wrote:> These all look good to me and I''m pulling them in.
Great, thanks!
> > The last item is a change to SNAP_DESTROY to allow deletion of a 
> > snapshot when the user owns the subvol''s root inode and the
parent
> > directory permissions are such that we would have allowed an rmdir(2).
> > Goffredo Baroncelli posted a similar patch that replicates the
rmdir(2)
> > semantics completely (except for the empty directory check) by 
> > duplicating some VFS code.  Whether we want weaker semantics,
duplicated
> > code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this
> > is distinct from a similar patch (also from Goffredo) that allows 
> > rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
> > subvol to be deleted by a non-root user.  As long as I can do that, my
> > daemon doesn''t have to run as root and I''m a happy
camper.  :)
> 
> Someone at the storage workshop mentioned that this subvol deletion
> trick is slightly stronger than rm -rf, to make it include the same
> level of permission checks would require testing all the directories in
> the tree for permissions.
I think that was me :)
 > For now, could you please make a mount -o user_subvol_rm_allowed option?
> (or something similar with a better name).
Sure.  

Do you have a preference as far as what checks are implemented?  My patch 
implemented a simplified approximation of may_rmdir(); Goffredo''s 
duplicated the vfs checks.  I guess I''m leaning toward the latter...

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Oct-25 19:58 UTC

head link

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

On Mon, Oct 25, 2010 at 12:41:46PM -0700, Sage Weil
wrote:> On Mon, 25 Oct 2010, Chris Mason wrote:
> > These all look good to me and I''m pulling them in.
> 
> Great, thanks!
> 
> > > The last item is a change to SNAP_DESTROY to allow deletion of a 
> > > snapshot when the user owns the subvol''s root inode and
the parent
> > > directory permissions are such that we would have allowed an
rmdir(2).
> > > Goffredo Baroncelli posted a similar patch that replicates the
rmdir(2)
> > > semantics completely (except for the empty directory check) by 
> > > duplicating some VFS code.  Whether we want weaker semantics,
duplicated
> > > code, or some new EXPORT_SYMBOLS is up to you I think.  Note that
this
> > > is distinct from a similar patch (also from Goffredo) that allows
> > > rmdir(2) to remove an empty subvol; my goal is to allow a
non-empty
> > > subvol to be deleted by a non-root user.  As long as I can do
that, my
> > > daemon doesn''t have to run as root and I''m a
happy camper.  :)
> > 
> > Someone at the storage workshop mentioned that this subvol deletion
> > trick is slightly stronger than rm -rf, to make it include the same
> > level of permission checks would require testing all the directories
in
> > the tree for permissions.
> 
> I think that was me :)
Grin, two different people then ;)
>  
> > For now, could you please make a mount -o user_subvol_rm_allowed
option?
> > (or something similar with a better name).
> 
> Sure.  
> 
> Do you have a preference as far as what checks are implemented?  My patch 
> implemented a simplified approximation of may_rmdir(); Goffredo''s 
> duplicated the vfs checks.  I guess I''m leaning toward the
latter...
Yes, lets duplicate the vfs checks.  Christoph just sat bolt upright in
whatever ski lift he''s currently riding.

We should also make sure they do the subvol rm against the root of the
subvol (if that check isn''t already done), none of the magic to resolve
the subvol based on any file inside it.  I don''t want people to
accidentally think they are deleting a subdir and have it go higher up
into the directory tree.

Oh, and it shouldn''t work on the root of the FS either ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-25 21:27 UTC

head link

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

On Mon, 25 Oct 2010, Chris Mason wrote:> Yes, lets duplicate the vfs checks.  Christoph just sat bolt upright in
> whatever ski lift he''s currently riding.
:)

This version is based on Goffredo''s patch, but requires the 
user_subvol_rm_allowed mount option, and will proceed even if the subvol 
is nonempty.  (I suspect we also want his other rmdir(2) patch at some 
point as well so that users can rmdir an empty subvol and/or ''rm
-r''.)
> We should also make sure they do the subvol rm against the root of the
> subvol (if that check isn''t already done), none of the magic to
resolve
> the subvol based on any file inside it.  I don''t want people to
> accidentally think they are deleting a subdir and have it go higher up
> into the directory tree.
> 
> Oh, and it shouldn''t work on the root of the FS either ;)
Since the ioctl is based on a name lookup, I added a check that the parent 
inode''s root isn''t the same as the target root.  That implies
the ioctl is
called the dentry referencing the subvol, and presumably the root
doesn''t
have one of those.  Does that cover it?  It isn''t possible to have the 
subvol bound to multiple locations in the namespace, right?

$ whoami
sage
$ btrfs sub create a
Create subvolume ''./a''
$ btrfs sub create a/b
Create subvolume ''a/b''
$ btrfs sub snap a asnap
Create a snapshot of ''a'' in ''./asnap''
$ btrfs sub del asnap/b 
ERROR: ''asnap/b'' is not a subvolume
$ btrfs sub del a
Delete subvolume ''/mnt/osd27/a/a''
ERROR: cannot delete ''/mnt/osd27/a/a''
$ btrfs sub del a/b
Delete subvolume ''/mnt/osd27/a/a/b''
$ btrfs sub del a  
Delete subvolume ''/mnt/osd27/a/a''
$ btrfs sub del asnap
Delete subvolume ''/mnt/osd27/a/asnap''

Sending the patch separately.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

liubo

2010-Oct-26 06:46 UTC

head link

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

On 10/26/2010 03:07 AM, Sage Weil wrote:> We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
> num_writers > 1 or should_grow at the top of the loop.  Then, much much
> later, we wait for that timeout if either num_writers or should_grow is
> true.  However, it''s possible for a racing process (calling
> btrfs_end_transaction()) to decrement num_writers such that we wait
> forever instead of for 1.
> 
IMO, there still exists a deadlock with your patch.
==with your patch:

thread 1:                          thread 2:                       

btrfs_commit_transaction()
  if (num_writers > 1)
    timeout = MAX_TIMEOUT;
	                --------->
                                   __btrfs_end_transaction()
                                     num_writers--;
				     if (wq)  
                                       wake_up();
			<---------
  smp_mb();
  prepare_wait();
  if (num_writers > 1)
    schedule_timeout(MAX);
  else if (should_grow)
    schedule_timeout(1);
==
thread2 also needs a memory_barrier, for without memory_barrier, 
on some CPUs, "if (wq)" may be executed before num_writers--, like
==thread 1:                          thread 2:                       

btrfs_commit_transaction()
  if (num_writers > 1)
    timeout = MAX_TIMEOUT;
	                --------->
                                   __btrfs_end_transaction()
                                     if (wq)  
                                       wake_up();
			<---------
  smp_mb();
  prepare_wait();
  if (num_writers > 1)
    schedule_timeout(MAX);
  else if (should_grow)
    schedule_timeout(1); 
			---------->
				      num_writers--;
==then, thread1 may wait forever.

Since wake_up() itself provides a implied wmb, and a wq active check, 
it is better to drop "if (wq)" in __btrfs_end_transaction().


thanks,
liubo
> Fix this by deciding how long to wait when we wait.
> 
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
>  fs/btrfs/transaction.c |   12 ++++--------
>  1 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 66e4c66..bf399ea 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
>  			     struct btrfs_root *root)
>  {
>  	unsigned long joined = 0;
> -	unsigned long timeout = 1;
>  	struct btrfs_transaction *cur_trans;
>  	struct btrfs_transaction *prev_trans = NULL;
>  	DEFINE_WAIT(wait);
> @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct
btrfs_trans_handle *trans,
>  			snap_pending = 1;
>  
>  		WARN_ON(cur_trans != trans->transaction);
> -		if (cur_trans->num_writers > 1)
> -			timeout = MAX_SCHEDULE_TIMEOUT;
> -		else if (should_grow)
> -			timeout = 1;
> -
>  		mutex_unlock(&root->fs_info->trans_mutex);
>  
>  		if (flush_on_commit || snap_pending) {
> @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct
btrfs_trans_handle *trans,
>  				TASK_UNINTERRUPTIBLE);
>  
>  		smp_mb();
> -		if (cur_trans->num_writers > 1 || should_grow)
> -			schedule_timeout(timeout);
> +		if (cur_trans->num_writers > 1)
> +			schedule_timeout(MAX_SCHEDULE_TIMEOUT);
> +		else if (should_grow)
> +			schedule_timeout(1);
>  
>  		mutex_lock(&root->fs_info->trans_mutex);
>  		finish_wait(&cur_trans->writer_wait, &wait);



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2010-Oct-26 16:36 UTC

head link

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

On Tue, 26 Oct 2010, liubo wrote:> On 10/26/2010 03:07 AM, Sage Weil wrote:
> > We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on
whether
> > num_writers > 1 or should_grow at the top of the loop.  Then, much
much
> > later, we wait for that timeout if either num_writers or should_grow
is
> > true.  However, it''s possible for a racing process (calling
> > btrfs_end_transaction()) to decrement num_writers such that we wait
> > forever instead of for 1.
> > 
> 
> IMO, there still exists a deadlock with your patch.
> ==> with your patch:
> 
> thread 1:                          thread 2:                       
> 
> btrfs_commit_transaction()
>   if (num_writers > 1)
>     timeout = MAX_TIMEOUT;
(This bit goes away, btw.)> 	                --------->
>                                    __btrfs_end_transaction()
>                                      num_writers--;
> 				     if (wq)  
>                                        wake_up();
> 			<---------
>   smp_mb();
>   prepare_wait();
>   if (num_writers > 1)
>     schedule_timeout(MAX);
>   else if (should_grow)
>     schedule_timeout(1);
> ==What''s the problem above?  The wake_up() doesn''t get called,
and thread1
doesn''t sleep.
> thread2 also needs a memory_barrier, for without memory_barrier, 
> on some CPUs, "if (wq)" may be executed before num_writers--,
like
> ==> thread 1:                          thread 2:                       
> 
> btrfs_commit_transaction()
>   if (num_writers > 1)
>     timeout = MAX_TIMEOUT;(This bit is gone)
> 	                --------->
>                                    __btrfs_end_transaction()
>                                      if (wq)  
>                                        wake_up();
> 			<---------
>   smp_mb();
>   prepare_wait();
>   if (num_writers > 1)
>     schedule_timeout(MAX);
>   else if (should_grow)
>     schedule_timeout(1); 
> 			---------->
> 				      num_writers--;
> ==> then, thread1 may wait forever.
> 
> Since wake_up() itself provides a implied wmb, and a wq active check, 
> it is better to drop "if (wq)" in __btrfs_end_transaction().
I see.  It could also be

        smb_mb();
        if (wq)
                wake_up();

but just calling wake_up() unconditionally is simpler, and fewer barriers 
in the wake_up case.  I''m not attached to the if (wq); I just kept it 
because it was there already.

Chris?

Thanks!
sage




> 
> 
> thanks,
> liubo
> 
> > Fix this by deciding how long to wait when we wait.
> > 
> > Signed-off-by: Sage Weil <sage@newdream.net>
> > ---
> >  fs/btrfs/transaction.c |   12 ++++--------
> >  1 files changed, 4 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 66e4c66..bf399ea 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct
btrfs_trans_handle *trans,
> >  			     struct btrfs_root *root)
> >  {
> >  	unsigned long joined = 0;
> > -	unsigned long timeout = 1;
> >  	struct btrfs_transaction *cur_trans;
> >  	struct btrfs_transaction *prev_trans = NULL;
> >  	DEFINE_WAIT(wait);
> > @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct
btrfs_trans_handle *trans,
> >  			snap_pending = 1;
> >  
> >  		WARN_ON(cur_trans != trans->transaction);
> > -		if (cur_trans->num_writers > 1)
> > -			timeout = MAX_SCHEDULE_TIMEOUT;
> > -		else if (should_grow)
> > -			timeout = 1;
> > -
> >  		mutex_unlock(&root->fs_info->trans_mutex);
> >  
> >  		if (flush_on_commit || snap_pending) {
> > @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct
btrfs_trans_handle *trans,
> >  				TASK_UNINTERRUPTIBLE);
> >  
> >  		smp_mb();
> > -		if (cur_trans->num_writers > 1 || should_grow)
> > -			schedule_timeout(timeout);
> > +		if (cur_trans->num_writers > 1)
> > +			schedule_timeout(MAX_SCHEDULE_TIMEOUT);
> > +		else if (should_grow)
> > +			schedule_timeout(1);
> >  
> >  		mutex_lock(&root->fs_info->trans_mutex);
> >  		finish_wait(&cur_trans->writer_wait, &wait);
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Oct-26 17:06 UTC

head link

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

On Tue, Oct 26, 2010 at 09:36:26AM -0700, Sage Weil
wrote:> On Tue, 26 Oct 2010, liubo wrote:
> > 
> > Since wake_up() itself provides a implied wmb, and a wq active check, 
> > it is better to drop "if (wq)" in __btrfs_end_transaction().
> 
> I see.  It could also be
> 
>         smb_mb();
>         if (wq)
>                 wake_up();
> 
> but just calling wake_up() unconditionally is simpler, and fewer barriers 
> in the wake_up case.  I''m not attached to the if (wq); I just kept
it
> because it was there already.
wake_up() provides an implied barrier because it takes the lock.  I
usually do the smp_mb() + if (wq) dance when I''m working on a
relatively
hot waitqueue.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

liubo

2010-Oct-27 00:41 UTC

head link

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

On 10/27/2010 01:06 AM, Chris Mason wrote:> On Tue, Oct 26, 2010 at 09:36:26AM -0700, Sage Weil wrote:
>> On Tue, 26 Oct 2010, liubo wrote:
>>> Since wake_up() itself provides a implied wmb, and a wq active
check,
>>> it is better to drop "if (wq)" in
__btrfs_end_transaction().
>> I see.  It could also be
>>
>>         smb_mb();
>>         if (wq)
>>                 wake_up();
>>
>> but just calling wake_up() unconditionally is simpler, and fewer
barriers
>> in the wake_up case.  I''m not attached to the if (wq); I just
kept it
>> because it was there already.
> 
> wake_up() provides an implied barrier because it takes the lock.  I
> usually do the smp_mb() + if (wq) dance when I''m working on a
relatively
> hot waitqueue.
Anyway, it is your style.
:) 

thanks,
liubo
> 
> -chris
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possibly Parallel Threads

Search for more seemingly similar threads

Btrfs devel - Oct 2010 - [PATCH 0/6] Btrfs commit fixes, async subvol operations

[PATCH 0/6] Btrfs commit fixes, async subvol operations

[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

[PATCH 2/6] Btrfs: async transaction commit

[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls

[PATCH 4/6] Btrfs: add SNAP_CREATE_ASYNC ioctl

[PATCH 5/6] Btrfs: make SNAP_DESTROY async

[PATCH 6/6] Btrfs: allow subvol deletion by owner

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

Re: [PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

Possibly Parallel Threads