thr3ads.net - Btrfs devel - [RFC] big fat transaction ioctl [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Sage Weil

2009-Nov-10 20:12 UTC

[RFC] big fat transaction ioctl

Hi all,

This is an alternative approach to atomic user transactions for btrfs.  
The old start/end ioctls suffer from some basic limitations, namely

 - We can''t properly reserve space ahead of time to avoid ENOSPC part 
way through the transaction, and
 - The process may die (seg fault, SIGKILL) part way through the 
transaction.  Currently when that happens the partial transaction will 
commit.

This patch implements an ioctl that lets the application completely 
specify the entire transaction in a single syscall.  If the process gets 
killed or seg faults part way through, the entire transaction will still 
complete.

The goal is to atomically commit updates to multiple files, xattrs, 
directories.  But this is still a file system: we don''t get rollback if
things go wrong.  Instead, do what we can up front to make sure things 
will work out.  And if things do go wrong, optionally prevent a partial 
result from reaching the disk.

A few things:

 - The implementation just exports the sys_* calls it needs (a popular 
move, no doubt :).  I''ve looked at using the corresponding vfs_* 
instructions instead, and keeping a table of struct file *''s instead of
fd''s to avoid these exports, but this requires a large amount of 
duplication of semi-boilerplate path lookup, security_path_* hooks, and 
similar code from fs/namei.c and elsewhere.  If we want to go that 
route, there are some advantages, the main one being that we can verify 
that every dentry/inode we operate on belongs to the same fs.  But the 
code will be more complex... I''m not sure if I should pursue that just 
yet.

 - The application gets to define what defines a failure for each 
individual op based on its return value.

 - If the transaction fails, the process can instruct the fs to wedge 
itself so that a partial result does not commit.  This isn''t a
particuarly
elegant approach, but a wedged fs may be preferable to a partial 
transaction commit.  (Alternatively, a failure could branch/jump to 
another point in the transaction op vector to do some cleanup and/or an 
explicit WEDGE op to accomplish the same thing?)

- This still uses the existing ioctl start transaction call.  Depending on 
how Josef''s ENOSPC journal_info stuff works out, I should be able to
avoid
the current global open_ioctl_trans counter for a cleaner interaction with 
the btrfs transaction code.

- The data space reservation is still missing.  I need a way to 
find which space_info will be used, and pin it for the duration 
of the entire transaction.

- The metadata reservation is a worst case bound.  It could be less 
conservative, but currently each op is pulled out of the user address 
space individually so we''d either need two passes, a big kmalloc, or 
further trust the app to get the value right.  (Same goes for the data 
size, actually, although that''s easier to get correct.)

Thoughts on this?

Thanks-
sage


Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   49 ++++++++++++++
 fs/namei.c       |    3 +
 fs/open.c        |    2 +
 fs/read_write.c  |    2 +
 fs/xattr.c       |    2 +
 6 files changed, 245 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 136c5ed..4269616 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -37,6 +37,7 @@
 #include <linux/compat.h>
 #include <linux/bit_spinlock.h>
 #include <linux/security.h>
+#include <linux/syscalls.h>
 #include <linux/xattr.h>
 #include <linux/vmalloc.h>
 #include "compat.h"
@@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
 	return 0;
 }
 
+/*
+ * return number of successfully complete ops via @ops_completed
+ * (where success/failure is defined by the _FAIL_* flags).
+ */
+static long do_usertrans(struct btrfs_root *root,
+			 struct btrfs_ioctl_usertrans *ut,
+			 u64 *ops_completed)
+{
+	int i;
+	int *fds;
+	int err;
+	struct file *file;
+	struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
+	int fd1, fd2;
+
+	fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	for (i = 0; i < ut->num_ops; i++) {
+		struct btrfs_ioctl_usertrans_op op;
+		int ret;
+
+		err = -EFAULT;
+		if (copy_from_user(&op, &ops[i], sizeof(op)))
+			goto out;
+
+		/* lookup fd args? */
+		err = -EINVAL;
+		switch (op.op) {
+		case BTRFS_IOC_UT_OP_CLONERANGE:
+			if (op.args[1] < 0 || op.args[1] >= ut->num_fds)
+				goto out;
+			fd2 = fds[1];
+
+		case BTRFS_IOC_UT_OP_CLOSE:
+		case BTRFS_IOC_UT_OP_PWRITE:
+			if (op.args[0] < 0 || op.args[0] >= ut->num_fds)
+				goto out;
+			fd1 = fds[0];
+		}
+
+		/* do op */
+		switch (op.op) {
+		case BTRFS_IOC_UT_OP_OPEN:
+			ret = -EINVAL;
+			if (op.args[3] < 0 || op.args[3] >= ut->num_fds)
+				goto out;
+			ret = sys_open((const char __user *)op.args[0],
+				       op.args[1], op.args[2]);
+			fds[op.args[3]] = ret;
+			break;
+		case BTRFS_IOC_UT_OP_CLOSE:
+			ret = sys_close(fd1);
+			break;
+		case BTRFS_IOC_UT_OP_PWRITE:
+			ret = sys_pwrite64(fd1, (const char __user *)op.args[1],
+					   op.args[2], op.args[3]);
+			break;
+		case BTRFS_IOC_UT_OP_UNLINK:
+			ret = sys_unlink((const char __user *)op.args[0]);
+			break;
+		case BTRFS_IOC_UT_OP_MKDIR:
+			ret = sys_mkdir((const char __user *)op.args[0],
+				op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_RMDIR:
+			ret = sys_rmdir((const char __user *)op.args[0]);
+			break;
+		case BTRFS_IOC_UT_OP_TRUNCATE:
+			ret = sys_truncate((const char __user *)op.args[0],
+					   op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_SETXATTR:
+			ret = sys_setxattr((char __user *)op.args[0],
+					   (char __user *)op.args[1],
+					   (void __user *)op.args[2],
+					   op.args[3], op.args[4]);
+			break;
+		case BTRFS_IOC_UT_OP_REMOVEXATTR:
+			ret = sys_removexattr((char __user *)op.args[0],
+					      (char __user *)op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_CLONERANGE:
+			ret = -EBADF;
+			file = fget(fd1);
+			if (file) {
+				ret = btrfs_ioctl_clone(file, fd2,
+							op.args[2], op.args[3],
+							op.args[4]);
+				fput(file);
+			}
+			break;
+		}
+		pr_debug(" ut %d/%d op %d args %llx %llx %llx %llx %llx = %d\n",
+			 i, (int)ut->num_ops, (int)op.op, op.args[0],
+			 op.args[1], op.args[2], op.args[3], op.args[4], ret);
+
+		put_user(ret, &ops[i].rval);
+
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE) &&
+		    ret != op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ) &&
+		    ret == op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT) &&
+		    ret < op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT) &&
+		    ret > op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE) &&
+		    ret <= op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE) &&
+		    ret >= op.rval)
+			goto out;
+	}
+	err = 0;
+out:
+	*ops_completed = i;
+	kfree(fds);
+	return err;
+}
+
+long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
+{
+	struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
+	u64 ops_completed = 0;
+	int ret;
+
+	ret = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	ret = -EFAULT;
+	if (copy_from_user(&ut, orig_ut, sizeof(ut)))
+		goto out;
+
+	ret = mnt_want_write(file->f_path.mnt);
+	if (ret)
+		goto out;
+
+	ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
+	if (ret)
+		goto out_drop_write;
+
+	mutex_lock(&root->fs_info->trans_mutex);
+	root->fs_info->open_ioctl_trans++;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	ret = -ENOMEM;
+	trans = btrfs_start_ioctl_transaction(root, 0);
+	if (!trans)
+		goto out_drop;
+
+	ret = do_usertrans(root, &ut, &ops_completed);
+	put_user(ops_completed, &orig_ut->ops_completed);
+
+	if (ret < 0 && (ut.flags & BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
+		pr_err("btrfs: usertrans failed, wedging to avoid partial "
+		       " commit\n");
+	else
+		btrfs_end_transaction(trans, root);
+
+out_drop:
+	mutex_lock(&root->fs_info->trans_mutex);
+	root->fs_info->open_ioctl_trans--;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
+out_drop_write:
+	mnt_drop_write(file->f_path.mnt);
+out:
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SYNC:
 		btrfs_sync_fs(file->f_dentry->d_sb, 1);
 		return 0;
+	case BTRFS_IOC_USERTRANS:
+		return btrfs_ioctl_usertrans(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index bc49914..f94e293 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
 				struct btrfs_ioctl_vol_args)
+
+/* usertrans ops */
+/* the ''fd'' values are _indices_ into a temporary fd table,
see num_fds below */
+#define BTRFS_IOC_UT_OP_OPEN         1  /* path, flags, mode, fd */
+#define BTRFS_IOC_UT_OP_CLOSE        2  /* fd */
+#define BTRFS_IOC_UT_OP_PWRITE       3  /* fd, data, length, offset */
+#define BTRFS_IOC_UT_OP_UNLINK       4  /* path */
+#define BTRFS_IOC_UT_OP_LINK         5  /* oldpath, newpath */
+#define BTRFS_IOC_UT_OP_MKDIR        6  /* path, mode */
+#define BTRFS_IOC_UT_OP_RMDIR        7  /* path */
+#define BTRFS_IOC_UT_OP_TRUNCATE     8  /* path, size */
+#define BTRFS_IOC_UT_OP_SETXATTR     9  /* path, name, data, len */
+#define BTRFS_IOC_UT_OP_REMOVEXATTR 10  /* path, name */
+#define BTRFS_IOC_UT_OP_CLONERANGE  11  /* dst fd, src fd, off, len, dst off */
+
+/* define what ''failure'' entails for each op based on return
value */
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE    (1<< 1)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ    (1<< 2)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT    (1<< 3)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT    (1<< 4)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE   (1<< 5)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE   (1<< 6)
+
+struct btrfs_ioctl_usertrans_op {
+	__u64 op;
+	__s64 args[5];
+	__s64 rval;
+	__u64 flags;
+};
+
+/*
+ * If an op fails and we cannot complete the transaction, we may want
+ * to lock up the file system (requiring a reboot) to prevent a
+ * partial result from committing.
+ */
+#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
+
+struct btrfs_ioctl_usertrans {
+	__u64 num_ops;                  /* in: # ops */
+	__u64 ops_ptr;                  /* in: usertrans_op array */
+	__u64 num_fds;	                /* in: size of fd table (max fd + 1) */
+	__u64 data_bytes, metadata_ops; /* in: for space reservation */
+	__u64 flags;                    /* in: flags */
+	__u64 ops_completed;            /* out: # ops completed */
+};
+
+#define BTRFS_IOC_USERTRANS  _IOW(BTRFS_IOCTL_MAGIC, 16,	\
+				  struct btrfs_ioctl_usertrans)
+
 #endif
diff --git a/fs/namei.c b/fs/namei.c
index d11f404..4d53225 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int,
mode)
 {
 	return sys_mkdirat(AT_FDCWD, pathname, mode);
 }
+EXPORT_SYMBOL(sys_mkdir);
 
 /*
  * We try to drop the dentry early: we should have
@@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
 {
 	return do_rmdir(AT_FDCWD, pathname);
 }
+EXPORT_SYMBOL(sys_rmdir);
 
 int vfs_unlink(struct inode *dir, struct dentry *dentry)
 {
@@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
 {
 	return do_unlinkat(AT_FDCWD, pathname);
 }
+EXPORT_SYMBOL(sys_unlink);
 
 int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
 {
diff --git a/fs/open.c b/fs/open.c
index 4f01e06..15eddfc 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, path, long,
length)
 {
 	return do_sys_truncate(path, length);
 }
+EXPORT_SYMBOL(sys_truncate);
 
 static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 {
@@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename, int,
flags, int, mode)
 	asmlinkage_protect(3, ret, filename, flags, mode);
 	return ret;
 }
+EXPORT_SYMBOL(sys_open);
 
 SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
 		int, mode)
diff --git a/fs/read_write.c b/fs/read_write.c
index 3ac2898..75e9f60 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user
*buf,
 
 	return ret;
 }
+EXPORT_SYMBOL(sys_pwrite64);
+
 #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
 asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t pos)
 {
diff --git a/fs/xattr.c b/fs/xattr.c
index 6d4f6d3..488c889 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
 	path_put(&path);
 	return error;
 }
+EXPORT_SYMBOL(sys_setxattr);
 
 SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 		const char __user *, name, const void __user *, value,
@@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *, pathname,
 	path_put(&path);
 	return error;
 }
+EXPORT_SYMBOL(sys_removexattr);
 
 SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
 		const char __user *, name)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrey Kuzmin

2009-Nov-10 20:44 UTC

head link

Re: [RFC] big fat transaction ioctl

On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net>
wrote:> Hi all,
>
> This is an alternative approach to atomic user transactions for btrfs.
> The old start/end ioctls suffer from some basic limitations, namely
>
>  - We can''t properly reserve space ahead of time to avoid ENOSPC
part
> way through the transaction, and
>  - The process may die (seg fault, SIGKILL) part way through the
> transaction.  Currently when that happens the partial transaction will
> commit.
>
> This patch implements an ioctl that lets the application completely
> specify the entire transaction in a single syscall.  If the process gets
> killed or seg faults part way through, the entire transaction will still
> complete.
>
> The goal is to atomically commit updates to multiple files, xattrs,
> directories.  But this is still a file system: we don''t get
rollback if
> things go wrong.  Instead, do what we can up front to make sure things
> will work out.  And if things do go wrong, optionally prevent a partial
> result from reaching the disk.
Why not snapshot respective root (doesn''t work if transaction spans
multiple file-systems, but this doesn''t look like a real-world
limitation), run txn against that snapshot and rollback on failure
instead? Snapshots are writable, cheap, and this looks like a real
transaction abort mechanism.

Regards,
Andrey
>
> A few things:
>
>  - The implementation just exports the sys_* calls it needs (a popular
> move, no doubt :).  I''ve looked at using the corresponding vfs_*
> instructions instead, and keeping a table of struct file *''s
instead of
> fd''s to avoid these exports, but this requires a large amount of
> duplication of semi-boilerplate path lookup, security_path_* hooks, and
> similar code from fs/namei.c and elsewhere.  If we want to go that
> route, there are some advantages, the main one being that we can verify
> that every dentry/inode we operate on belongs to the same fs.  But the
> code will be more complex... I''m not sure if I should pursue that
just
> yet.
>
>  - The application gets to define what defines a failure for each
> individual op based on its return value.
>
>  - If the transaction fails, the process can instruct the fs to wedge
> itself so that a partial result does not commit.  This isn''t a
particuarly
> elegant approach, but a wedged fs may be preferable to a partial
> transaction commit.  (Alternatively, a failure could branch/jump to
> another point in the transaction op vector to do some cleanup and/or an
> explicit WEDGE op to accomplish the same thing?)
>
> - This still uses the existing ioctl start transaction call.  Depending on
> how Josef''s ENOSPC journal_info stuff works out, I should be able
to avoid
> the current global open_ioctl_trans counter for a cleaner interaction with
> the btrfs transaction code.
>
> - The data space reservation is still missing.  I need a way to
> find which space_info will be used, and pin it for the duration
> of the entire transaction.
>
> - The metadata reservation is a worst case bound.  It could be less
> conservative, but currently each op is pulled out of the user address
> space individually so we''d either need two passes, a big kmalloc,
or
> further trust the app to get the value right.  (Same goes for the data
> size, actually, although that''s easier to get correct.)
>
> Thoughts on this?
>
> Thanks-
> sage
>
>
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
>  fs/btrfs/ioctl.c |  187
++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/ioctl.h |   49 ++++++++++++++
>  fs/namei.c       |    3 +
>  fs/open.c        |    2 +
>  fs/read_write.c  |    2 +
>  fs/xattr.c       |    2 +
>  6 files changed, 245 insertions(+), 0 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 136c5ed..4269616 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -37,6 +37,7 @@
>  #include <linux/compat.h>
>  #include <linux/bit_spinlock.h>
>  #include <linux/security.h>
> +#include <linux/syscalls.h>
>  #include <linux/xattr.h>
>  #include <linux/vmalloc.h>
>  #include "compat.h"
> @@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
>        return 0;
>  }
>
> +/*
> + * return number of successfully complete ops via @ops_completed
> + * (where success/failure is defined by the _FAIL_* flags).
> + */
> +static long do_usertrans(struct btrfs_root *root,
> +                        struct btrfs_ioctl_usertrans *ut,
> +                        u64 *ops_completed)
> +{
> +       int i;
> +       int *fds;
> +       int err;
> +       struct file *file;
> +       struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
> +       int fd1, fd2;
> +
> +       fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
> +       if (!fds)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < ut->num_ops; i++) {
> +               struct btrfs_ioctl_usertrans_op op;
> +               int ret;
> +
> +               err = -EFAULT;
> +               if (copy_from_user(&op, &ops[i], sizeof(op)))
> +                       goto out;
> +
> +               /* lookup fd args? */
> +               err = -EINVAL;
> +               switch (op.op) {
> +               case BTRFS_IOC_UT_OP_CLONERANGE:
> +                       if (op.args[1] < 0 || op.args[1] >=
ut->num_fds)
> +                               goto out;
> +                       fd2 = fds[1];
> +
> +               case BTRFS_IOC_UT_OP_CLOSE:
> +               case BTRFS_IOC_UT_OP_PWRITE:
> +                       if (op.args[0] < 0 || op.args[0] >=
ut->num_fds)
> +                               goto out;
> +                       fd1 = fds[0];
> +               }
> +
> +               /* do op */
> +               switch (op.op) {
> +               case BTRFS_IOC_UT_OP_OPEN:
> +                       ret = -EINVAL;
> +                       if (op.args[3] < 0 || op.args[3] >=
ut->num_fds)
> +                               goto out;
> +                       ret = sys_open((const char __user *)op.args[0],
> +                                      op.args[1], op.args[2]);
> +                       fds[op.args[3]] = ret;
> +                       break;
> +               case BTRFS_IOC_UT_OP_CLOSE:
> +                       ret = sys_close(fd1);
> +                       break;
> +               case BTRFS_IOC_UT_OP_PWRITE:
> +                       ret = sys_pwrite64(fd1, (const char __user
*)op.args[1],
> +                                          op.args[2], op.args[3]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_UNLINK:
> +                       ret = sys_unlink((const char __user *)op.args[0]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_MKDIR:
> +                       ret = sys_mkdir((const char __user *)op.args[0],
> +                               op.args[1]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_RMDIR:
> +                       ret = sys_rmdir((const char __user *)op.args[0]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_TRUNCATE:
> +                       ret = sys_truncate((const char __user *)op.args[0],
> +                                          op.args[1]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_SETXATTR:
> +                       ret = sys_setxattr((char __user *)op.args[0],
> +                                          (char __user *)op.args[1],
> +                                          (void __user *)op.args[2],
> +                                          op.args[3], op.args[4]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_REMOVEXATTR:
> +                       ret = sys_removexattr((char __user *)op.args[0],
> +                                             (char __user *)op.args[1]);
> +                       break;
> +               case BTRFS_IOC_UT_OP_CLONERANGE:
> +                       ret = -EBADF;
> +                       file = fget(fd1);
> +                       if (file) {
> +                               ret = btrfs_ioctl_clone(file, fd2,
> +                                                       op.args[2],
op.args[3],
> +                                                       op.args[4]);
> +                               fput(file);
> +                       }
> +                       break;
> +               }
> +               pr_debug(" ut %d/%d op %d args %llx %llx %llx %llx
%llx = %d\n",
> +                        i, (int)ut->num_ops, (int)op.op, op.args[0],
> +                        op.args[1], op.args[2], op.args[3], op.args[4],
ret);
> +
> +               put_user(ret, &ops[i].rval);
> +
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE)
&&
> +                   ret != op.rval)
> +                       goto out;
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ)
&&
> +                   ret == op.rval)
> +                       goto out;
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT)
&&
> +                   ret < op.rval)
> +                       goto out;
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT)
&&
> +                   ret > op.rval)
> +                       goto out;
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE)
&&
> +                   ret <= op.rval)
> +                       goto out;
> +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE)
&&
> +                   ret >= op.rval)
> +                       goto out;
> +       }
> +       err = 0;
> +out:
> +       *ops_completed = i;
> +       kfree(fds);
> +       return err;
> +}
> +
> +long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
> +{
> +       struct btrfs_root *root =
BTRFS_I(fdentry(file)->d_inode)->root;
> +       struct btrfs_trans_handle *trans;
> +       struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
> +       u64 ops_completed = 0;
> +       int ret;
> +
> +       ret = -EPERM;
> +       if (!capable(CAP_SYS_ADMIN))
> +               goto out;
> +
> +       ret = -EFAULT;
> +       if (copy_from_user(&ut, orig_ut, sizeof(ut)))
> +               goto out;
> +
> +       ret = mnt_want_write(file->f_path.mnt);
> +       if (ret)
> +               goto out;
> +
> +       ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
> +       if (ret)
> +               goto out_drop_write;
> +
> +       mutex_lock(&root->fs_info->trans_mutex);
> +       root->fs_info->open_ioctl_trans++;
> +       mutex_unlock(&root->fs_info->trans_mutex);
> +
> +       ret = -ENOMEM;
> +       trans = btrfs_start_ioctl_transaction(root, 0);
> +       if (!trans)
> +               goto out_drop;
> +
> +       ret = do_usertrans(root, &ut, &ops_completed);
> +       put_user(ops_completed, &orig_ut->ops_completed);
> +
> +       if (ret < 0 && (ut.flags &
BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
> +               pr_err("btrfs: usertrans failed, wedging to avoid
partial "
> +                      " commit\n");
> +       else
> +               btrfs_end_transaction(trans, root);
> +
> +out_drop:
> +       mutex_lock(&root->fs_info->trans_mutex);
> +       root->fs_info->open_ioctl_trans--;
> +       mutex_unlock(&root->fs_info->trans_mutex);
> +
> +       btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
> +out_drop_write:
> +       mnt_drop_write(file->f_path.mnt);
> +out:
> +       return ret;
> +}
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>                cmd, unsigned long arg)
>  {
> @@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>        case BTRFS_IOC_SYNC:
>                btrfs_sync_fs(file->f_dentry->d_sb, 1);
>                return 0;
> +       case BTRFS_IOC_USERTRANS:
> +               return btrfs_ioctl_usertrans(file, argp);
>        }
>
>        return -ENOTTY;
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index bc49914..f94e293 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
>                                   struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
>                                struct btrfs_ioctl_vol_args)
> +
> +/* usertrans ops */
> +/* the ''fd'' values are _indices_ into a temporary fd
table, see num_fds below */
> +#define BTRFS_IOC_UT_OP_OPEN         1  /* path, flags, mode, fd */
> +#define BTRFS_IOC_UT_OP_CLOSE        2  /* fd */
> +#define BTRFS_IOC_UT_OP_PWRITE       3  /* fd, data, length, offset */
> +#define BTRFS_IOC_UT_OP_UNLINK       4  /* path */
> +#define BTRFS_IOC_UT_OP_LINK         5  /* oldpath, newpath */
> +#define BTRFS_IOC_UT_OP_MKDIR        6  /* path, mode */
> +#define BTRFS_IOC_UT_OP_RMDIR        7  /* path */
> +#define BTRFS_IOC_UT_OP_TRUNCATE     8  /* path, size */
> +#define BTRFS_IOC_UT_OP_SETXATTR     9  /* path, name, data, len */
> +#define BTRFS_IOC_UT_OP_REMOVEXATTR 10  /* path, name */
> +#define BTRFS_IOC_UT_OP_CLONERANGE  11  /* dst fd, src fd, off, len, dst
off */
> +
> +/* define what ''failure'' entails for each op based on
return value */
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE    (1<< 1)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ    (1<< 2)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT    (1<< 3)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT    (1<< 4)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE   (1<< 5)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE   (1<< 6)
> +
> +struct btrfs_ioctl_usertrans_op {
> +       __u64 op;
> +       __s64 args[5];
> +       __s64 rval;
> +       __u64 flags;
> +};
> +
> +/*
> + * If an op fails and we cannot complete the transaction, we may want
> + * to lock up the file system (requiring a reboot) to prevent a
> + * partial result from committing.
> + */
> +#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
> +
> +struct btrfs_ioctl_usertrans {
> +       __u64 num_ops;                  /* in: # ops */
> +       __u64 ops_ptr;                  /* in: usertrans_op array */
> +       __u64 num_fds;                  /* in: size of fd table (max fd +
1) */
> +       __u64 data_bytes, metadata_ops; /* in: for space reservation */
> +       __u64 flags;                    /* in: flags */
> +       __u64 ops_completed;            /* out: # ops completed */
> +};
> +
> +#define BTRFS_IOC_USERTRANS  _IOW(BTRFS_IOCTL_MAGIC, 16,       \
> +                                 struct btrfs_ioctl_usertrans)
> +
>  #endif
> diff --git a/fs/namei.c b/fs/namei.c
> index d11f404..4d53225 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname,
int, mode)
>  {
>        return sys_mkdirat(AT_FDCWD, pathname, mode);
>  }
> +EXPORT_SYMBOL(sys_mkdir);
>
>  /*
>  * We try to drop the dentry early: we should have
> @@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
>  {
>        return do_rmdir(AT_FDCWD, pathname);
>  }
> +EXPORT_SYMBOL(sys_rmdir);
>
>  int vfs_unlink(struct inode *dir, struct dentry *dentry)
>  {
> @@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *,
pathname)
>  {
>        return do_unlinkat(AT_FDCWD, pathname);
>  }
> +EXPORT_SYMBOL(sys_unlink);
>
>  int vfs_symlink(struct inode *dir, struct dentry *dentry, const char
*oldname)
>  {
> diff --git a/fs/open.c b/fs/open.c
> index 4f01e06..15eddfc 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, path,
long, length)
>  {
>        return do_sys_truncate(path, length);
>  }
> +EXPORT_SYMBOL(sys_truncate);
>
>  static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
>  {
> @@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename,
int, flags, int, mode)
>        asmlinkage_protect(3, ret, filename, flags, mode);
>        return ret;
>  }
> +EXPORT_SYMBOL(sys_open);
>
>  SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int,
flags,
>                int, mode)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3ac2898..75e9f60 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char
__user *buf,
>
>        return ret;
>  }
> +EXPORT_SYMBOL(sys_pwrite64);
> +
>  #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
>  asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t pos)
>  {
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 6d4f6d3..488c889 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *,
pathname,
>        path_put(&path);
>        return error;
>  }
> +EXPORT_SYMBOL(sys_setxattr);
>
>  SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
>                const char __user *, name, const void __user *, value,
> @@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *,
pathname,
>        path_put(&path);
>        return error;
>  }
> +EXPORT_SYMBOL(sys_removexattr);
>
>  SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
>                const char __user *, name)
> --
> 1.5.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Nov-10 22:13 UTC

head link

Re: [RFC] big fat transaction ioctl

On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net>
wrote:
> > Hi all,
> >
> > This is an alternative approach to atomic user transactions for btrfs.
> > The old start/end ioctls suffer from some basic limitations, namely
> >
> >  - We can''t properly reserve space ahead of time to avoid
ENOSPC part
> > way through the transaction, and
> >  - The process may die (seg fault, SIGKILL) part way through the
> > transaction.  Currently when that happens the partial transaction will
> > commit.
> >
> > This patch implements an ioctl that lets the application completely
> > specify the entire transaction in a single syscall.  If the process
gets
> > killed or seg faults part way through, the entire transaction will
still
> > complete.
> >
> > The goal is to atomically commit updates to multiple files, xattrs,
> > directories.  But this is still a file system: we don''t get
rollback if
> > things go wrong.  Instead, do what we can up front to make sure things
> > will work out.  And if things do go wrong, optionally prevent a
partial
> > result from reaching the disk.
> 
> Why not snapshot respective root (doesn''t work if transaction
spans
> multiple file-systems, but this doesn''t look like a real-world
> limitation), run txn against that snapshot and rollback on failure
> instead? Snapshots are writable, cheap, and this looks like a real
> transaction abort mechanism.
Good question.  :)

I hadn''t looked into this before, but I think the snapshots could be
used
to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
quiesce writes, it can make sure all transactions complete before creating 
a snapshot (commit).  The problem with this currently is the create 
snapshot ioctl is relatively slow... it calls commit_transaction, which 
blocks until everything reaches disk.  I think to perform well this 
approach would need a hook to start a commit and then return as soon as it 
can guarantee than any subsequent operation''s start_transaction
can''t join
in that commit.

This may be a better way to go about this, though.  Does that sound 
reasonable, Chris?

sage

> 
> Regards,
> Andrey
> 
> >
> > A few things:
> >
> >  - The implementation just exports the sys_* calls it needs (a popular
> > move, no doubt :).  I''ve looked at using the corresponding
vfs_*
> > instructions instead, and keeping a table of struct file *''s
instead of
> > fd''s to avoid these exports, but this requires a large amount
of
> > duplication of semi-boilerplate path lookup, security_path_* hooks,
and
> > similar code from fs/namei.c and elsewhere.  If we want to go that
> > route, there are some advantages, the main one being that we can
verify
> > that every dentry/inode we operate on belongs to the same fs.  But the
> > code will be more complex... I''m not sure if I should pursue
that just
> > yet.
> >
> >  - The application gets to define what defines a failure for each
> > individual op based on its return value.
> >
> >  - If the transaction fails, the process can instruct the fs to wedge
> > itself so that a partial result does not commit.  This isn''t
a particuarly
> > elegant approach, but a wedged fs may be preferable to a partial
> > transaction commit.  (Alternatively, a failure could branch/jump to
> > another point in the transaction op vector to do some cleanup and/or
an
> > explicit WEDGE op to accomplish the same thing?)
> >
> > - This still uses the existing ioctl start transaction call.
 Depending on
> > how Josef''s ENOSPC journal_info stuff works out, I should be
able to avoid
> > the current global open_ioctl_trans counter for a cleaner interaction
with
> > the btrfs transaction code.
> >
> > - The data space reservation is still missing.  I need a way to
> > find which space_info will be used, and pin it for the duration
> > of the entire transaction.
> >
> > - The metadata reservation is a worst case bound.  It could be less
> > conservative, but currently each op is pulled out of the user address
> > space individually so we''d either need two passes, a big
kmalloc, or
> > further trust the app to get the value right.  (Same goes for the data
> > size, actually, although that''s easier to get correct.)
> >
> > Thoughts on this?
> >
> > Thanks-
> > sage
> >
> >
> > Signed-off-by: Sage Weil <sage@newdream.net>
> > ---
> >  fs/btrfs/ioctl.c |  187
++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/ioctl.h |   49 ++++++++++++++
> >  fs/namei.c       |    3 +
> >  fs/open.c        |    2 +
> >  fs/read_write.c  |    2 +
> >  fs/xattr.c       |    2 +
> >  6 files changed, 245 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 136c5ed..4269616 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/compat.h>
> >  #include <linux/bit_spinlock.h>
> >  #include <linux/security.h>
> > +#include <linux/syscalls.h>
> >  #include <linux/xattr.h>
> >  #include <linux/vmalloc.h>
> >  #include "compat.h"
> > @@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
> >        return 0;
> >  }
> >
> > +/*
> > + * return number of successfully complete ops via @ops_completed
> > + * (where success/failure is defined by the _FAIL_* flags).
> > + */
> > +static long do_usertrans(struct btrfs_root *root,
> > +                        struct btrfs_ioctl_usertrans *ut,
> > +                        u64 *ops_completed)
> > +{
> > +       int i;
> > +       int *fds;
> > +       int err;
> > +       struct file *file;
> > +       struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
> > +       int fd1, fd2;
> > +
> > +       fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
> > +       if (!fds)
> > +               return -ENOMEM;
> > +
> > +       for (i = 0; i < ut->num_ops; i++) {
> > +               struct btrfs_ioctl_usertrans_op op;
> > +               int ret;
> > +
> > +               err = -EFAULT;
> > +               if (copy_from_user(&op, &ops[i], sizeof(op)))
> > +                       goto out;
> > +
> > +               /* lookup fd args? */
> > +               err = -EINVAL;
> > +               switch (op.op) {
> > +               case BTRFS_IOC_UT_OP_CLONERANGE:
> > +                       if (op.args[1] < 0 || op.args[1] >=
ut->num_fds)
> > +                               goto out;
> > +                       fd2 = fds[1];
> > +
> > +               case BTRFS_IOC_UT_OP_CLOSE:
> > +               case BTRFS_IOC_UT_OP_PWRITE:
> > +                       if (op.args[0] < 0 || op.args[0] >=
ut->num_fds)
> > +                               goto out;
> > +                       fd1 = fds[0];
> > +               }
> > +
> > +               /* do op */
> > +               switch (op.op) {
> > +               case BTRFS_IOC_UT_OP_OPEN:
> > +                       ret = -EINVAL;
> > +                       if (op.args[3] < 0 || op.args[3] >=
ut->num_fds)
> > +                               goto out;
> > +                       ret = sys_open((const char __user
*)op.args[0],
> > +                                      op.args[1], op.args[2]);
> > +                       fds[op.args[3]] = ret;
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_CLOSE:
> > +                       ret = sys_close(fd1);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_PWRITE:
> > +                       ret = sys_pwrite64(fd1, (const char __user
*)op.args[1],
> > +                                          op.args[2], op.args[3]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_UNLINK:
> > +                       ret = sys_unlink((const char __user
*)op.args[0]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_MKDIR:
> > +                       ret = sys_mkdir((const char __user
*)op.args[0],
> > +                               op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_RMDIR:
> > +                       ret = sys_rmdir((const char __user
*)op.args[0]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_TRUNCATE:
> > +                       ret = sys_truncate((const char __user
*)op.args[0],
> > +                                          op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_SETXATTR:
> > +                       ret = sys_setxattr((char __user *)op.args[0],
> > +                                          (char __user *)op.args[1],
> > +                                          (void __user *)op.args[2],
> > +                                          op.args[3], op.args[4]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_REMOVEXATTR:
> > +                       ret = sys_removexattr((char __user
*)op.args[0],
> > +                                             (char __user
*)op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_CLONERANGE:
> > +                       ret = -EBADF;
> > +                       file = fget(fd1);
> > +                       if (file) {
> > +                               ret = btrfs_ioctl_clone(file, fd2,
> > +                                                       op.args[2],
op.args[3],
> > +                                                       op.args[4]);
> > +                               fput(file);
> > +                       }
> > +                       break;
> > +               }
> > +               pr_debug(" ut %d/%d op %d args %llx %llx %llx
%llx %llx = %d\n",
> > +                        i, (int)ut->num_ops, (int)op.op,
op.args[0],
> > +                        op.args[1], op.args[2], op.args[3],
op.args[4], ret);
> > +
> > +               put_user(ret, &ops[i].rval);
> > +
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE)
&&
> > +                   ret != op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ)
&&
> > +                   ret == op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT)
&&
> > +                   ret < op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT)
&&
> > +                   ret > op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE)
&&
> > +                   ret <= op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE)
&&
> > +                   ret >= op.rval)
> > +                       goto out;
> > +       }
> > +       err = 0;
> > +out:
> > +       *ops_completed = i;
> > +       kfree(fds);
> > +       return err;
> > +}
> > +
> > +long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
> > +{
> > +       struct btrfs_root *root =
BTRFS_I(fdentry(file)->d_inode)->root;
> > +       struct btrfs_trans_handle *trans;
> > +       struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
> > +       u64 ops_completed = 0;
> > +       int ret;
> > +
> > +       ret = -EPERM;
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               goto out;
> > +
> > +       ret = -EFAULT;
> > +       if (copy_from_user(&ut, orig_ut, sizeof(ut)))
> > +               goto out;
> > +
> > +       ret = mnt_want_write(file->f_path.mnt);
> > +       if (ret)
> > +               goto out;
> > +
> > +       ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
> > +       if (ret)
> > +               goto out_drop_write;
> > +
> > +       mutex_lock(&root->fs_info->trans_mutex);
> > +       root->fs_info->open_ioctl_trans++;
> > +       mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > +       ret = -ENOMEM;
> > +       trans = btrfs_start_ioctl_transaction(root, 0);
> > +       if (!trans)
> > +               goto out_drop;
> > +
> > +       ret = do_usertrans(root, &ut, &ops_completed);
> > +       put_user(ops_completed, &orig_ut->ops_completed);
> > +
> > +       if (ret < 0 && (ut.flags &
BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
> > +               pr_err("btrfs: usertrans failed, wedging to avoid
partial "
> > +                      " commit\n");
> > +       else
> > +               btrfs_end_transaction(trans, root);
> > +
> > +out_drop:
> > +       mutex_lock(&root->fs_info->trans_mutex);
> > +       root->fs_info->open_ioctl_trans--;
> > +       mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > +       btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
> > +out_drop_write:
> > +       mnt_drop_write(file->f_path.mnt);
> > +out:
> > +       return ret;
> > +}
> > +
> >  long btrfs_ioctl(struct file *file, unsigned int
> >                cmd, unsigned long arg)
> >  {
> > @@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> >        case BTRFS_IOC_SYNC:
> >                btrfs_sync_fs(file->f_dentry->d_sb, 1);
> >                return 0;
> > +       case BTRFS_IOC_USERTRANS:
> > +               return btrfs_ioctl_usertrans(file, argp);
> >        }
> >
> >        return -ENOTTY;
> > diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> > index bc49914..f94e293 100644
> > --- a/fs/btrfs/ioctl.h
> > +++ b/fs/btrfs/ioctl.h
> > @@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
> >                                   struct btrfs_ioctl_vol_args)
> >  #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
> >                                struct btrfs_ioctl_vol_args)
> > +
> > +/* usertrans ops */
> > +/* the ''fd'' values are _indices_ into a temporary
fd table, see num_fds below */
> > +#define BTRFS_IOC_UT_OP_OPEN         1  /* path, flags, mode, fd */
> > +#define BTRFS_IOC_UT_OP_CLOSE        2  /* fd */
> > +#define BTRFS_IOC_UT_OP_PWRITE       3  /* fd, data, length, offset
*/
> > +#define BTRFS_IOC_UT_OP_UNLINK       4  /* path */
> > +#define BTRFS_IOC_UT_OP_LINK         5  /* oldpath, newpath */
> > +#define BTRFS_IOC_UT_OP_MKDIR        6  /* path, mode */
> > +#define BTRFS_IOC_UT_OP_RMDIR        7  /* path */
> > +#define BTRFS_IOC_UT_OP_TRUNCATE     8  /* path, size */
> > +#define BTRFS_IOC_UT_OP_SETXATTR     9  /* path, name, data, len */
> > +#define BTRFS_IOC_UT_OP_REMOVEXATTR 10  /* path, name */
> > +#define BTRFS_IOC_UT_OP_CLONERANGE  11  /* dst fd, src fd, off, len,
dst off */
> > +
> > +/* define what ''failure'' entails for each op based
on return value */
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE    (1<< 1)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ    (1<< 2)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT    (1<< 3)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT    (1<< 4)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE   (1<< 5)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE   (1<< 6)
> > +
> > +struct btrfs_ioctl_usertrans_op {
> > +       __u64 op;
> > +       __s64 args[5];
> > +       __s64 rval;
> > +       __u64 flags;
> > +};
> > +
> > +/*
> > + * If an op fails and we cannot complete the transaction, we may want
> > + * to lock up the file system (requiring a reboot) to prevent a
> > + * partial result from committing.
> > + */
> > +#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
> > +
> > +struct btrfs_ioctl_usertrans {
> > +       __u64 num_ops;                  /* in: # ops */
> > +       __u64 ops_ptr;                  /* in: usertrans_op array */
> > +       __u64 num_fds;                  /* in: size of fd table (max
fd + 1) */
> > +       __u64 data_bytes, metadata_ops; /* in: for space reservation
*/
> > +       __u64 flags;                    /* in: flags */
> > +       __u64 ops_completed;            /* out: # ops completed */
> > +};
> > +
> > +#define BTRFS_IOC_USERTRANS  _IOW(BTRFS_IOCTL_MAGIC, 16,       \
> > +                                 struct btrfs_ioctl_usertrans)
> > +
> >  #endif
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d11f404..4d53225 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *,
pathname, int, mode)
> >  {
> >        return sys_mkdirat(AT_FDCWD, pathname, mode);
> >  }
> > +EXPORT_SYMBOL(sys_mkdir);
> >
> >  /*
> >  * We try to drop the dentry early: we should have
> > @@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *,
pathname)
> >  {
> >        return do_rmdir(AT_FDCWD, pathname);
> >  }
> > +EXPORT_SYMBOL(sys_rmdir);
> >
> >  int vfs_unlink(struct inode *dir, struct dentry *dentry)
> >  {
> > @@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *,
pathname)
> >  {
> >        return do_unlinkat(AT_FDCWD, pathname);
> >  }
> > +EXPORT_SYMBOL(sys_unlink);
> >
> >  int vfs_symlink(struct inode *dir, struct dentry *dentry, const char
*oldname)
> >  {
> > diff --git a/fs/open.c b/fs/open.c
> > index 4f01e06..15eddfc 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *,
path, long, length)
> >  {
> >        return do_sys_truncate(path, length);
> >  }
> > +EXPORT_SYMBOL(sys_truncate);
> >
> >  static long do_sys_ftruncate(unsigned int fd, loff_t length, int
small)
> >  {
> > @@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *,
filename, int, flags, int, mode)
> >        asmlinkage_protect(3, ret, filename, flags, mode);
> >        return ret;
> >  }
> > +EXPORT_SYMBOL(sys_open);
> >
> >  SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int,
flags,
> >                int, mode)
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3ac2898..75e9f60 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const
char __user *buf,
> >
> >        return ret;
> >  }
> > +EXPORT_SYMBOL(sys_pwrite64);
> > +
> >  #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
> >  asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t
pos)
> >  {
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 6d4f6d3..488c889 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *,
pathname,
> >        path_put(&path);
> >        return error;
> >  }
> > +EXPORT_SYMBOL(sys_setxattr);
> >
> >  SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
> >                const char __user *, name, const void __user *, value,
> > @@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *,
pathname,
> >        path_put(&path);
> >        return error;
> >  }
> > +EXPORT_SYMBOL(sys_removexattr);
> >
> >  SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
> >                const char __user *, name)
> > --
> > 1.5.6.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

Jeremy Fitzhardinge

2009-Nov-11 00:49 UTC

head link

Re: [RFC] big fat transaction ioctl

On 11/10/09 14:13, Sage Weil wrote:> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>
>   
>> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net>
wrote:
>>     
>>> Hi all,
>>>
>>> This is an alternative approach to atomic user transactions for
btrfs.
>>> The old start/end ioctls suffer from some basic limitations, namely
>>>
>>>  - We can''t properly reserve space ahead of time to avoid
ENOSPC part
>>> way through the transaction, and
>>>  - The process may die (seg fault, SIGKILL) part way through the
>>> transaction.  Currently when that happens the partial transaction
will
>>> commit.
>>>
>>> This patch implements an ioctl that lets the application completely
>>> specify the entire transaction in a single syscall.  If the process
gets
>>> killed or seg faults part way through, the entire transaction will
still
>>> complete.
>>>
>>> The goal is to atomically commit updates to multiple files, xattrs,
>>> directories.  But this is still a file system: we don''t
get rollback if
>>> things go wrong.  Instead, do what we can up front to make sure
things
>>> will work out.  And if things do go wrong, optionally prevent a
partial
>>> result from reaching the disk.
>>>       
>> Why not snapshot respective root (doesn''t work if transaction
spans
>> multiple file-systems, but this doesn''t look like a real-world
>> limitation), run txn against that snapshot and rollback on failure
>> instead? Snapshots are writable, cheap, and this looks like a real
>> transaction abort mechanism.
>>     
> Good question.  :)
>
> I hadn''t looked into this before, but I think the snapshots could
be used
> to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> quiesce writes, it can make sure all transactions complete before creating 
> a snapshot (commit).  The problem with this currently is the create 
> snapshot ioctl is relatively slow... it calls commit_transaction, which 
> blocks until everything reaches disk.  I think to perform well this 
> approach would need a hook to start a commit and then return as soon as it 
> can guarantee than any subsequent operation''s start_transaction
can''t join
> in that commit.
>
> This may be a better way to go about this, though.  Does that sound 
> reasonable, Chris?
>   
If snapshots only capture what''s currently physically on disk, then it
means that the transactions will be fairly heavyweight in requiring
everything to be physically synced.  That may be what some apps want
anyway, but I can certainly imagine apps wanting transaction semantics
without having fsync-level durability requirements.

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Nov-11 05:15 UTC

head link

Re: [RFC] big fat transaction ioctl

On Tue, 10 Nov 2009, Jeremy Fitzhardinge wrote:
> On 11/10/09 14:13, Sage Weil wrote:
> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> >
> >   
> >> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil
<sage@newdream.net> wrote:
> >>     
> >>> Hi all,
> >>>
> >>> This is an alternative approach to atomic user transactions
for btrfs.
> >>> The old start/end ioctls suffer from some basic limitations,
namely
> >>>
> >>>  - We can''t properly reserve space ahead of time to
avoid ENOSPC part
> >>> way through the transaction, and
> >>>  - The process may die (seg fault, SIGKILL) part way through
the
> >>> transaction.  Currently when that happens the partial
transaction will
> >>> commit.
> >>>
> >>> This patch implements an ioctl that lets the application
completely
> >>> specify the entire transaction in a single syscall.  If the
process gets
> >>> killed or seg faults part way through, the entire transaction
will still
> >>> complete.
> >>>
> >>> The goal is to atomically commit updates to multiple files,
xattrs,
> >>> directories.  But this is still a file system: we
don''t get rollback if
> >>> things go wrong.  Instead, do what we can up front to make
sure things
> >>> will work out.  And if things do go wrong, optionally prevent
a partial
> >>> result from reaching the disk.
> >>>       
> >> Why not snapshot respective root (doesn''t work if
transaction spans
> >> multiple file-systems, but this doesn''t look like a
real-world
> >> limitation), run txn against that snapshot and rollback on failure
> >> instead? Snapshots are writable, cheap, and this looks like a real
> >> transaction abort mechanism.
> >>     
> > Good question.  :)
> >
> > I hadn''t looked into this before, but I think the snapshots
could be used
> > to achieve both atomicity and rollback.  If userspace uses an rw mutex
to
> > quiesce writes, it can make sure all transactions complete before
creating
> > a snapshot (commit).  The problem with this currently is the create 
> > snapshot ioctl is relatively slow... it calls commit_transaction,
which
> > blocks until everything reaches disk.  I think to perform well this 
> > approach would need a hook to start a commit and then return as soon
as it
> > can guarantee than any subsequent operation''s
start_transaction can''t join
> > in that commit.
> >
> > This may be a better way to go about this, though.  Does that sound 
> > reasonable, Chris?
> >   
> 
> If snapshots only capture what''s currently physically on disk,
then it
> means that the transactions will be fairly heavyweight in requiring
> everything to be physically synced.  That may be what some apps want
> anyway, but I can certainly imagine apps wanting transaction semantics
> without having fsync-level durability requirements.
Just to be clear, the transactions I''m talking about _only_ control the
way operations are grouped when they commit to disk.  They do not in any 
way affect the view of the file system that another concurrently running 
process might see.  It''s the application''s responsibility to
deal with
the sort of transaction concurrency you''re talking about.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Nov-11 14:54 UTC

head link

Re: [RFC] big fat transaction ioctl

On Tue, Nov 10, 2009 at 12:12:14PM -0800, Sage Weil
wrote:> Hi all,
> 
> This is an alternative approach to atomic user transactions for btrfs.  
> The old start/end ioctls suffer from some basic limitations, namely
> 
>  - We can''t properly reserve space ahead of time to avoid ENOSPC
part
> way through the transaction, and
>  - The process may die (seg fault, SIGKILL) part way through the 
> transaction.  Currently when that happens the partial transaction will 
> commit.
I like this much more than providing a journal start/stop to userland.
If we can get Christoph to ack the exports we can work on the interface
in general.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Nov-11 15:03 UTC

head link

Re: [RFC] big fat transaction ioctl

On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil
wrote:> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> 
> > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net>
wrote:
> > > Hi all,
> > >
> > > This is an alternative approach to atomic user transactions for
btrfs.
> > > The old start/end ioctls suffer from some basic limitations,
namely
> > >
> > >  - We can''t properly reserve space ahead of time to
avoid ENOSPC part
> > > way through the transaction, and
> > >  - The process may die (seg fault, SIGKILL) part way through the
> > > transaction.  Currently when that happens the partial transaction
will
> > > commit.
> > >
> > > This patch implements an ioctl that lets the application
completely
> > > specify the entire transaction in a single syscall.  If the
process gets
> > > killed or seg faults part way through, the entire transaction
will still
> > > complete.
> > >
> > > The goal is to atomically commit updates to multiple files,
xattrs,
> > > directories.  But this is still a file system: we don''t
get rollback if
> > > things go wrong.  Instead, do what we can up front to make sure
things
> > > will work out.  And if things do go wrong, optionally prevent a
partial
> > > result from reaching the disk.
> > 
> > Why not snapshot respective root (doesn''t work if transaction
spans
> > multiple file-systems, but this doesn''t look like a
real-world
> > limitation), run txn against that snapshot and rollback on failure
> > instead? Snapshots are writable, cheap, and this looks like a real
> > transaction abort mechanism.
> 
> Good question.  :)
> 
> I hadn''t looked into this before, but I think the snapshots could
be used
> to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> quiesce writes, it can make sure all transactions complete before creating 
> a snapshot (commit).  The problem with this currently is the create 
> snapshot ioctl is relatively slow... it calls commit_transaction, which 
> blocks until everything reaches disk.  I think to perform well this 
> approach would need a hook to start a commit and then return as soon as it 
> can guarantee than any subsequent operation''s start_transaction
can''t join
> in that commit.
> 
> This may be a better way to go about this, though.  Does that sound 
> reasonable, Chris?
Yes, we could do this, but I don''t think it will perform very well
compared to your multi-operation ioctl.  It really does depend on how
often you need to do atomic ops (my guess is very).

Honestly you''ll get better performance with a simple write-ahead log
from userland:

step1: write redo log somewhere in the FS, with enough information to
bring all the objects you''re about to touch to a consistent state.
step2: fsync the log
step3: do your operations
step4: append a record to the undo log that invalidates the last log
op, or just truncate it to zero.
step5: fsync the log.

The big advantage of the log is that you won''t be tied to btrfs, but
it''s two fsyncs where the big transaction framework does none.  This
should allow you to turn on the fast fsync log again, but I think the
multi-operation ioctl would do that as well.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrey Kuzmin

2009-Nov-11 15:41 UTC

head link

Re: [RFC] big fat transaction ioctl

On Wed, Nov 11, 2009 at 6:03 PM, Chris Mason <chris.mason@oracle.com>
wrote:> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
>> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>>
>> > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil
<sage@newdream.net> wrote:
>> > > Hi all,
>> > >
>> > > This is an alternative approach to atomic user transactions
for btrfs.
>> > > The old start/end ioctls suffer from some basic limitations,
namely
>> > >
>> > >  - We can''t properly reserve space ahead of time to
avoid ENOSPC part
>> > > way through the transaction, and
>> > >  - The process may die (seg fault, SIGKILL) part way through
the
>> > > transaction.  Currently when that happens the partial
transaction will
>> > > commit.
>> > >
>> > > This patch implements an ioctl that lets the application
completely
>> > > specify the entire transaction in a single syscall.  If the
process gets
>> > > killed or seg faults part way through, the entire transaction
will still
>> > > complete.
>> > >
>> > > The goal is to atomically commit updates to multiple files,
xattrs,
>> > > directories.  But this is still a file system: we
don''t get rollback if
>> > > things go wrong.  Instead, do what we can up front to make
sure things
>> > > will work out.  And if things do go wrong, optionally prevent
a partial
>> > > result from reaching the disk.
>> >
>> > Why not snapshot respective root (doesn''t work if
transaction spans
>> > multiple file-systems, but this doesn''t look like a
real-world
>> > limitation), run txn against that snapshot and rollback on failure
>> > instead? Snapshots are writable, cheap, and this looks like a real
>> > transaction abort mechanism.
>>
>> Good question.  :)
>>
>> I hadn''t looked into this before, but I think the snapshots
could be used
>> to achieve both atomicity and rollback.  If userspace uses an rw mutex
to
>> quiesce writes, it can make sure all transactions complete before
creating
>> a snapshot (commit).  The problem with this currently is the create
>> snapshot ioctl is relatively slow... it calls commit_transaction, which
>> blocks until everything reaches disk.  I think to perform well this
>> approach would need a hook to start a commit and then return as soon as
it
>> can guarantee than any subsequent operation''s
start_transaction can''t join
>> in that commit.
>>
>> This may be a better way to go about this, though.  Does that sound
>> reasonable, Chris?
>
> Yes, we could do this, but I don''t think it will perform very well
> compared to your multi-operation ioctl.  It really does depend on how
> often you need to do atomic ops (my guess is very).
>
> Honestly you''ll get better performance with a simple write-ahead
log
> from userland:
Write-ahead logging is necessary anyway if the aim is to provide
transactional semantics to an application. But, at the same time, w/o
snapshot there is no synchronization between the log and file-system
state.

Regards,
Andrey
>
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you''re about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
>
> The big advantage of the log is that you won''t be tied to btrfs,
but
> it''s two fsyncs where the big transaction framework does none.
 This
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
>
> -chris
>
>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Nov-11 15:55 UTC

head link

Re: [RFC] big fat transaction ioctl

On Wed, Nov 11, 2009 at 06:41:06PM +0300, Andrey Kuzmin
wrote:> >> I hadn''t looked into this before, but I think the
snapshots could be used
> >> to achieve both atomicity and rollback.  If userspace uses an rw
mutex to
> >> quiesce writes, it can make sure all transactions complete before
creating
> >> a snapshot (commit).  The problem with this currently is the
create
> >> snapshot ioctl is relatively slow... it calls commit_transaction,
which
> >> blocks until everything reaches disk.  I think to perform well
this
> >> approach would need a hook to start a commit and then return as
soon as it
> >> can guarantee than any subsequent operation''s
start_transaction can''t join
> >> in that commit.
> >>
> >> This may be a better way to go about this, though.  Does that
sound
> >> reasonable, Chris?
> >
> > Yes, we could do this, but I don''t think it will perform very
well
> > compared to your multi-operation ioctl.  It really does depend on how
> > often you need to do atomic ops (my guess is very).
> >
> > Honestly you''ll get better performance with a simple
write-ahead log
> > from userland:
> 
> Write-ahead logging is necessary anyway if the aim is to provide
> transactional semantics to an application.
Sage''s big fat ioctl does provide the subset of transactional semantics
that ceph (and many other apps) require.  In this case, they just want
to know that a given set of operations will happen together.
> But, at the same time, w/o
> snapshot there is no synchronization between the log and file-system
> state.
Synchronizing the log and the filesystem state happens when the
application starts up after the crash (either app crash or system
crash).  The application would be in charge of applying the log to its
own files to get the system into whatever state the app thinks is
consistent.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Nov-11 17:19 UTC

head link

Re: [RFC] big fat transaction ioctl

On Wed, 11 Nov 2009, Chris Mason wrote:
> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> > 
> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil
<sage@newdream.net> wrote:
> > > > Hi all,
> > > >
> > > > This is an alternative approach to atomic user transactions
for btrfs.
> > > > The old start/end ioctls suffer from some basic limitations,
namely
> > > >
> > > >  - We can''t properly reserve space ahead of time to
avoid ENOSPC part
> > > > way through the transaction, and
> > > >  - The process may die (seg fault, SIGKILL) part way through
the
> > > > transaction.  Currently when that happens the partial
transaction will
> > > > commit.
> > > >
> > > > This patch implements an ioctl that lets the application
completely
> > > > specify the entire transaction in a single syscall.  If the
process gets
> > > > killed or seg faults part way through, the entire
transaction will still
> > > > complete.
> > > >
> > > > The goal is to atomically commit updates to multiple files,
xattrs,
> > > > directories.  But this is still a file system: we
don''t get rollback if
> > > > things go wrong.  Instead, do what we can up front to make
sure things
> > > > will work out.  And if things do go wrong, optionally
prevent a partial
> > > > result from reaching the disk.
> > > 
> > > Why not snapshot respective root (doesn''t work if
transaction spans
> > > multiple file-systems, but this doesn''t look like a
real-world
> > > limitation), run txn against that snapshot and rollback on
failure
> > > instead? Snapshots are writable, cheap, and this looks like a
real
> > > transaction abort mechanism.
> > 
> > Good question.  :)
> > 
> > I hadn''t looked into this before, but I think the snapshots
could be used
> > to achieve both atomicity and rollback.  If userspace uses an rw mutex
to
> > quiesce writes, it can make sure all transactions complete before
creating
> > a snapshot (commit).  The problem with this currently is the create 
> > snapshot ioctl is relatively slow... it calls commit_transaction,
which
> > blocks until everything reaches disk.  I think to perform well this 
> > approach would need a hook to start a commit and then return as soon
as it
> > can guarantee than any subsequent operation''s
start_transaction can''t join
> > in that commit.
> > 
> > This may be a better way to go about this, though.  Does that sound 
> > reasonable, Chris?
> 
> Yes, we could do this, but I don''t think it will perform very well
> compared to your multi-operation ioctl.  It really does depend on how
> often you need to do atomic ops (my guess is very).
The thing is, I''m not sure using snaps is that different from what
I''m
doing now.  Currently the ioctl transactions don''t hit disk until each 
full commit (flushoncommit, no fsync).  Unless the presense of a snapshot 
adds additional overhead (to the commit, or to cleaning up the slightly 
longer-living snapped roots), the difference would be that starting 
transactions would need to be blocked by the application instead of 
wait_current_trans in start_transaction, and (currently at least) they 
would wait longer (the extra writes between blocked = 0 and commit_done = 
1 in commit_transaction).  

The key, as now, is keeping the full fs syncs infrequent.  And, if 
possible, reducing the duration of the blocked == 1 period during 
commit_transaction.

> Honestly you''ll get better performance with a simple write-ahead
log
> from userland:
There actually is a log, but it''s optional and not strictly
write-ahead...
it''s only used to reduce the commit latency:

1- apply operations to fs (grouped into atomic transactions)
2- (optionally) write and flush log entry
...repeat...
3- periodically sync the fs, then trim the log.  or sync early if a 
client explicitly requests it.

But

1- I don''t want to make the log required.  Sometimes you''re
more concerned
about total throughput, not latency, and the log halves your write bw 
unless you add more spindles.

2- I don''t want it strictly write-ahead because (in the absense of
atomic
ops) it means you have to wait for the log to sync before applying the ops 
to the fs (to ensure the fs doesn''t get a partial transaction ahead of
the
log).  This marries atomicity with your schedule for durability, which 
isn''t necessarily what you want.  (e.g., Ceph makes a distinction
between
serialized and commited ops, allowing limited sharing of data before it 
hits disk.  That''s the nice thing about this ioctl... it''s
pretty common
that atomicity is the only requirement.)

With the optional (write-behind?) log and transaction ioctls, IF you want 
low latency commits, enable the log and ideally give it it''s own
spindle,
and infrequently sync btrfs to get good layout and low overhead. 


Unless you think I''m missing something with the snapshot approach, I
can
give that a try and see how it does.  It requires explicit management of 
the sync/commit schedule, but in my case at least I''m doing that
already.
A transaction ioctl is simpler for userland and would be more generically 
useful for other apps (particularly those who don''t want to manage 
commits), but will always have some small possibility of partial 
failure/abort without rollback.

sage

> 
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you''re about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
> 
> The big advantage of the log is that you won''t be tied to btrfs,
but
> it''s two fsyncs where the big transaction framework does none. 
This
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
> 
> -chris
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

Zach Brown

2009-Nov-11 18:22 UTC

head link

Re: [RFC] big fat transaction ioctl

> I like this much more than providing a journal start/stop to userland.
> If we can get Christoph to ack the exports we can work on the interface
> in general.
I''ll note, briefly, that it seems dangerous to call right into the sys_
functions instead of going through the architecture''s syscall number
dispatching path.  Do you know if the syscalls you''re calling have
compat wrappers on some architectures for some userspace abis?

With that out of the way, though, I''ll get on to my bigger point.

This interface for specifying an array of syscalls to call looks a whole
lot like the work that fs/aio.c, syslets, and acall have all done.  The
flags for stopping processing of the array based on errors from the
syscalls are remarkably similar to Ingo''s atom structs.

So maybe there''s an opportunity for a generic syscall for processing
batches of syscalls.  Maybe you''ll bracket some of them with btrfs
ioctls for flagging the task_struct as being in a btrfs transaction, but
maybe you''ll also flag some for concurrent acall processing or nutty
syslet thread spawning if they block.

It''ll probably take some work to be able to call syscall handlers from
C
on all architectures, and we''d have to be really careful about the
semantics if we start mixing btrfs ioctls and async flags, but it just
might be worth it.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Nov-11 22:22 UTC

head link

Re: [RFC] big fat transaction ioctl

On Wed, 11 Nov 2009, Zach Brown wrote:> > I like this much more than providing a journal start/stop to userland.
> > If we can get Christoph to ack the exports we can work on the
interface
> > in general.
> 
> I''ll note, briefly, that it seems dangerous to call right into the
sys_
> functions instead of going through the architecture''s syscall
number
> dispatching path.  Do you know if the syscalls you''re calling have
> compat wrappers on some architectures for some userspace abis?
> 
> With that out of the way, though, I''ll get on to my bigger point.
> 
> This interface for specifying an array of syscalls to call looks a whole
> lot like the work that fs/aio.c, syslets, and acall have all done.  The
> flags for stopping processing of the array based on errors from the
> syscalls are remarkably similar to Ingo''s atom structs.
Yeah, I think both syslets (with atoms) and acall provide a much more 
elegant interface than what I''ve described.  (I should have looked at
them
more closely before; I didn''t care about being asynchronous.)  The only
real requirement for the atomic user transactions is that the batch of 
operations not be interrupted by SIGKILL or seg fault (at least for the 
calls that are being used).
> So maybe there''s an opportunity for a generic syscall for
processing
> batches of syscalls.  Maybe you''ll bracket some of them with btrfs
> ioctls for flagging the task_struct as being in a btrfs transaction, but
> maybe you''ll also flag some for concurrent acall processing or
nutty
> syslet thread spawning if they block.
Right.  In my case running things asynchronously isn''t strictly
necessary.
And it seems like the asynchrony and batching (in both syslets and acall) 
are somewhat orthogonal.  Maybe the async part should really be completely 
independent, and not, say, a flag?

	sys_batch(a vector/graph of ops)

and then

	sys_acall(a single call)
or	sys_syslet(a single call)	

Both Ingo''s atoms or the simpler acall op vector would suffice in my 
case.  Even current acall() would do; the thread overhead won''t be 
sigficant over the relatively slow fs transactions.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrey Kuzmin

2009-Nov-12 03:56 UTC

head link

Re: [RFC] big fat transaction ioctl

On Wed, Nov 11, 2009 at 8:19 PM, Sage Weil <sage@newdream.net>
wrote:> On Wed, 11 Nov 2009, Chris Mason wrote:
>
>> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
>> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>> >
>> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil
<sage@newdream.net> wrote:
>> > > > Hi all,
>> > > >
>> > > > This is an alternative approach to atomic user
transactions for btrfs.
>> > > > The old start/end ioctls suffer from some basic
limitations, namely
>> > > >
>> > > >  - We can''t properly reserve space ahead of
time to avoid ENOSPC part
>> > > > way through the transaction, and
>> > > >  - The process may die (seg fault, SIGKILL) part way
through the
>> > > > transaction.  Currently when that happens the partial
transaction will
>> > > > commit.
>> > > >
>> > > > This patch implements an ioctl that lets the application
completely
>> > > > specify the entire transaction in a single syscall.  If
the process gets
>> > > > killed or seg faults part way through, the entire
transaction will still
>> > > > complete.
>> > > >
>> > > > The goal is to atomically commit updates to multiple
files, xattrs,
>> > > > directories.  But this is still a file system: we
don''t get rollback if
>> > > > things go wrong.  Instead, do what we can up front to
make sure things
>> > > > will work out.  And if things do go wrong, optionally
prevent a partial
>> > > > result from reaching the disk.
>> > >
>> > > Why not snapshot respective root (doesn''t work if
transaction spans
>> > > multiple file-systems, but this doesn''t look like a
real-world
>> > > limitation), run txn against that snapshot and rollback on
failure
>> > > instead? Snapshots are writable, cheap, and this looks like a
real
>> > > transaction abort mechanism.
>> >
>> > Good question.  :)
>> >
>> > I hadn''t looked into this before, but I think the
snapshots could be used
>> > to achieve both atomicity and rollback.  If userspace uses an rw
mutex to
>> > quiesce writes, it can make sure all transactions complete before
creating
>> > a snapshot (commit).  The problem with this currently is the
create
>> > snapshot ioctl is relatively slow... it calls commit_transaction,
which
>> > blocks until everything reaches disk.  I think to perform well
this
>> > approach would need a hook to start a commit and then return as
soon as it
>> > can guarantee than any subsequent operation''s
start_transaction can''t join
>> > in that commit.
>> >
>> > This may be a better way to go about this, though.  Does that
sound
>> > reasonable, Chris?
>>
>> Yes, we could do this, but I don''t think it will perform very
well
>> compared to your multi-operation ioctl.  It really does depend on how
>> often you need to do atomic ops (my guess is very).
>
> The thing is, I''m not sure using snaps is that different from what
I''m
> doing now.  Currently the ioctl transactions don''t hit disk until
each
> full commit (flushoncommit, no fsync).  Unless the presense of a snapshot
> adds additional overhead (to the commit, or to cleaning up the slightly
> longer-living snapped roots), the difference would be that starting
> transactions would need to be blocked by the application instead of
> wait_current_trans in start_transaction, and (currently at least) they
> would wait longer (the extra writes between blocked = 0 and commit_done
> 1 in commit_transaction).
>
> The key, as now, is keeping the full fs syncs infrequent.  And, if
> possible, reducing the duration of the blocked == 1 period during
> commit_transaction.
It took me some time to associate you with Ceph project and to recall
what Ceph is, so my original snapshot suggestion was out-of-context.
When put into Ceph context, it looks too heavy-weight and may turn an
overkill. Chris''s write-ahead logging idea looks much more realistic
for your use case.
>
>
>> Honestly you''ll get better performance with a simple
write-ahead log
>> from userland:
>
> There actually is a log, but it''s optional and not strictly
write-ahead...
> it''s only used to reduce the commit latency:
>
> 1- apply operations to fs (grouped into atomic transactions)
> 2- (optionally) write and flush log entry
> ...repeat...
> 3- periodically sync the fs, then trim the log.  or sync early if a
> client explicitly requests it.
>
> But
>
> 1- I don''t want to make the log required.  Sometimes
you''re more concerned
> about total throughput, not latency, and the log halves your write bw
> unless you add more spindles.
Log-induced latency penalty is the price for transactional consistency
:). Traditional mitigation recipe involves low-latency log device
(NVRAM and, recently, SLC flash). Since you specifically target
distributed systems, you have a distributed in-memory logging option.

Regards,
Andrey
>
> 2- I don''t want it strictly write-ahead because (in the absense of
atomic
> ops) it means you have to wait for the log to sync before applying the ops
> to the fs (to ensure the fs doesn''t get a partial transaction
ahead of the
> log).  This marries atomicity with your schedule for durability, which
> isn''t necessarily what you want.  (e.g., Ceph makes a distinction
between
> serialized and commited ops, allowing limited sharing of data before it
> hits disk.  That''s the nice thing about this ioctl...
it''s pretty common
> that atomicity is the only requirement.)
>
> With the optional (write-behind?) log and transaction ioctls, IF you want
> low latency commits, enable the log and ideally give it it''s own
spindle,
> and infrequently sync btrfs to get good layout and low overhead.
>
>
> Unless you think I''m missing something with the snapshot approach,
I can
> give that a try and see how it does.  It requires explicit management of
> the sync/commit schedule, but in my case at least I''m doing that
already.
> A transaction ioctl is simpler for userland and would be more generically
> useful for other apps (particularly those who don''t want to manage
> commits), but will always have some small possibility of partial
> failure/abort without rollback.
>
> sage
>
>
>>
>> step1: write redo log somewhere in the FS, with enough information to
>> bring all the objects you''re about to touch to a consistent
state.
>> step2: fsync the log
>> step3: do your operations
>> step4: append a record to the undo log that invalidates the last log
>> op, or just truncate it to zero.
>> step5: fsync the log.
>>
>> The big advantage of the log is that you won''t be tied to
btrfs, but
>> it''s two fsyncs where the big transaction framework does none.
 This
>> should allow you to turn on the fast fsync log again, but I think the
>> multi-operation ioctl would do that as well.
>>
>> -chris
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reasonably Related Threads

Search for more seemingly similar threads

Btrfs devel - Nov 2009 - [RFC] big fat transaction ioctl

[RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Re: [RFC] big fat transaction ioctl

Reasonably Related Threads