thr3ads.net - Btrfs devel - [PATCH 0/4] btrfs: offline dedupe v2 [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Mark Fasheh

2013-Jun-11 20:31 UTC

[PATCH 0/4] btrfs: offline dedupe v2

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.

To be clear, "offline" in this sense means that the file system is
mounted and running, but the dedupe is not done during file writes,
but after the fact when some userspace software initiates a dedupe.

The primary patch is loosely based off of one sent by Josef Bacik back
in January, 2011.

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508

I''ve made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.


The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
	__u64 logical_offset;	/* in - start of extent in source */
	__u64 length;		/* in - length of extent */
	__u16 dest_count;	/* in - total elements in info array */
	__u16 reserved1;
	__u32 reserved2;
	struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it''s own status and
''bytes_deduped''
member. This provides a number of benefits:

- We don''t have to fail the entire ioctl because one of the dedupes
failed.

- Userspace will always know how much progress was made on a file as we always
  return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS	1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
	__s64 fd;		/* in - destination file */
	__u64 logical_offset;	/* in - start of extent in destination */
	__u64 bytes_deduped;	/* out - total # of bytes we were able
				 * to dedupe from this file */
	/* status of this dedupe operation:
	 * 0 if dedup succeeds
	 * < 0 for error
	 * == BTRFS_SAME_DATA_DIFFERS if data differs
	 */
	__s32 status;		/* out - see above description */
	__u32 reserved;
};


The kernel patches are based off Linux v3.9. At this point I''ve tested
the
ioctl against a decent variety of files and conditions.

A git tree for the kernel changes can be found at:

https://github.com/markfasheh/btrfs-extent-same


I have a userspace project, duperemove available at:

https://github.com/markfasheh/duperemove

Hopefully this can serve as an example of one possible usage of the ioctl.

duperemove takes a list of files as argument and will search them for
duplicated extents. If given the ''-D'' switch, duperemove will
send dedupe
requests for same extents and display the results.

Within the duperemove repo is a file, btrfs-extent-same.c that acts as
a test wrapper around the ioctl. It can be compiled completely
seperately from the rest of the project via "make
btrfs-extent-same". This makes direct testing of the ioctl more
convenient.


Limitations

We can''t yet dedupe within the same file (that is, source and
destination
are the same inode). This is due to a limitation in btrfs_clone().


Perhaps this isn''t a limiation per-se but extent-same requires
read/write
access to the files we want to dedupe.  During my last series I had a
conversation with Gabriel de Perthuis about access checking where we tried
to maintain the ability for a user to run extent-same against a readonly
snapshot. In addition, I reasoned that since the underlying data won''t
change (at least to the user) that we ought only require the files to be
open for read.

What I found however is that neither of these is a great idea ;)

- We want to require that the inode be open for writing so that an
  unprivileged user can''t do things like run dedupe on a performance
  sensitive file that they might only have read access to.  In addition I
  could see it as kind of a surprise (non-standard behavior) to an
  administrator that users could alter the layout of files they are only
  allowed to read.

- Readonly snapshots won''t let you open for write anyway
(unsuprisingly,
  open() returns -EROFS).  So that kind of kills the idea of them being able
  to open those files for write which we want to dedupe.

That said, I still think being able to run this against a set of readonly
snapshots makes sense especially if those snapshots are taken for backup
purposes. I''m just not sure how we can sanely enable it.



Code review is very much appreciated. Thanks,
     --Mark


ChangeLog

- check that we have appropriate access to each file before deduping. For
  the source, we only check that it is opened for read. Target files have to
  be open for write.

- don''t dedupe on readonly submounts (this is to maintain 

- check that we don''t dedupe files with different checksumming states
 (compare BTRFS_INODE_NODATASUM flags)

- get and maintain write access to the mount during the extent same
  operation (mount_want_write())

- allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
  pass them through for re-use on every call to btrfs_extent_same(). (thanks
  to David Sterba <dsterba@suse.cz> for reporting this

- As the read buffers could possibly be up to 1MB (depending on user
  request), we now conditionally vmalloc them.

- removed redundant check for same inode. btrfs_extent_same() catches it now
  and bubbles the error up.

- remove some unnecessary printks

Changes from RFC to v1:

- don''t error on large length value in btrfs exent-same, instead we
just
  dedupe the maximum allowed.  That way userspace doesn''t have to worry
  about an arbitrary length limit.

- btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
  a total of 16MB per request)

- cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
  David Sterba <dsterba@suse.cz> for reporting this)

- included two fixes from Gabriel de Perthuis <g2p.code@gmail.com>:
   - allow dedupe across subvolumes
   - don''t lock compressed pages twice when deduplicating

- removed some unused / poorly designed fields in btrfs_ioctl_same_args.
  This should also give us a bit more reserved bytes.

- return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
  David Sterba <dsterba@suse.cz> for reporting this)

- Some more reserved bytes are now included as a result of some of my
  cleanups. Quite possibly we could add a couple more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 20:31 UTC

head link

[PATCH 1/4] btrfs: abtract out range locking in clone ioctl()

The range locking in btrfs_ioctl_clone is trivially broken out into
it''s own
function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
and makes that locking code available to future functions in
fs/btrfs/ioctl.c

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c02310..ef53952 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2456,6 +2456,26 @@ out:
 	return ret;
 }
 
+static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
+{
+	/* do any pending delalloc/csum calc on src, one way or
+	   another, and lock file content */
+	while (1) {
+		struct btrfs_ordered_extent *ordered;
+		lock_extent(&BTRFS_I(inode)->io_tree, off, off + len - 1);
+		ordered = btrfs_lookup_first_ordered_extent(inode,
+							    off + len - 1);
+		if (!ordered &&
+		    !test_range_bit(&BTRFS_I(inode)->io_tree, off,
+				    off + len - 1, EXTENT_DELALLOC, 0, NULL))
+			break;
+		unlock_extent(&BTRFS_I(inode)->io_tree, off, off + len - 1);
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		btrfs_wait_ordered_range(inode, off, len);
+	}
+}
+
 static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 				       u64 off, u64 olen, u64 destoff)
 {
@@ -2573,21 +2593,7 @@ static noinline long btrfs_ioctl_clone(struct file *file,
unsigned long srcfd,
 	truncate_inode_pages_range(&inode->i_data, destoff,
 				   PAGE_CACHE_ALIGN(destoff + len) - 1);
 
-	/* do any pending delalloc/csum calc on src, one way or
-	   another, and lock file content */
-	while (1) {
-		struct btrfs_ordered_extent *ordered;
-		lock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
-		ordered = btrfs_lookup_first_ordered_extent(src, off + len - 1);
-		if (!ordered &&
-		    !test_range_bit(&BTRFS_I(src)->io_tree, off, off + len - 1,
-				    EXTENT_DELALLOC, 0, NULL))
-			break;
-		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
-		if (ordered)
-			btrfs_put_ordered_extent(ordered);
-		btrfs_wait_ordered_range(src, off, len);
-	}
+	lock_extent_range(src, off, len);
 
 	/* clone data */
 	key.objectid = btrfs_ino(src);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 20:31 UTC

head link

[PATCH 2/4] btrfs_ioctl_clone: Move clone code into it''s own function

There''s some 250+ lines here that are easily encapsulated into their
own
function. I don''t change how anything works here, just create and
document
the new btrfs_clone() function from btrfs_ioctl_clone() code.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c | 232 ++++++++++++++++++++++++++++++-------------------------
 1 file changed, 128 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ef53952..e90c519 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2476,125 +2476,43 @@ static inline void lock_extent_range(struct inode
*inode, u64 off, u64 len)
 	}
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+/**
+ * btrfs_clone() - clone a range from inode file to another
+ *
+ * @src: Inode to clone from
+ * @inode: Inode to clone to
+ * @off: Offset within source to start clone from
+ * @olen: Original length, passed by user, of range to clone
+ * @olen_aligned: Block-aligned value of olen, extent_same uses
+ *               identical values here
+ * @destoff: Offset within @inode to start clone
+ */
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff)
 {
-	struct inode *inode = file_inode(file);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
-	struct btrfs_trans_handle *trans;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	struct extent_buffer *leaf;
-	char *buf;
+	struct btrfs_trans_handle *trans;
+	char *buf = NULL;
 	struct btrfs_key key;
 	u32 nritems;
 	int slot;
 	int ret;
-	u64 len = olen;
-	u64 bs = root->fs_info->sb->s_blocksize;
-
-	/*
-	 * TODO:
-	 * - split compressed inline extents.  annoying: we need to
-	 *   decompress into destination''s address_space (the file offset
-	 *   may change, so source mapping won''t do), then recompress (or
-	 *   otherwise reinsert) a subrange.
-	 * - allow ranges within the same file to be cloned (provided
-	 *   they don''t overlap)?
-	 */
-
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
-	if (btrfs_root_readonly(root))
-		return -EROFS;
-
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		goto out_fput;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
-	/* don''t make the dst file partly checksummed */
-	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !-	   
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
-
-	ret = -EISDIR;
-	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+	u64 len = olen_aligned;
 
 	ret = -ENOMEM;
 	buf = vmalloc(btrfs_level_size(root, 0));
 	if (!buf)
-		goto out_fput;
+		return ret;
 
 	path = btrfs_alloc_path();
 	if (!path) {
 		vfree(buf);
-		goto out_fput;
-	}
-	path->reada = 2;
-
-	if (inode < src) {
-		mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&src->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&src->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
-	}
-
-	/* determine range to clone */
-	ret = -EINVAL;
-	if (off + len > src->i_size || off + len < off)
-		goto out_unlock;
-	if (len == 0)
-		olen = len = src->i_size - off;
-	/* if we extend to eof, continue to block boundary */
-	if (off + len == src->i_size)
-		len = ALIGN(src->i_size, bs) - off;
-
-	/* verify the end result is block aligned */
-	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
-	    !IS_ALIGNED(destoff, bs))
-		goto out_unlock;
-
-	if (destoff > inode->i_size) {
-		ret = btrfs_cont_expand(inode, inode->i_size, destoff);
-		if (ret)
-			goto out_unlock;
+		return ret;
 	}
 
-	/* truncate page cache pages from target inode range */
-	truncate_inode_pages_range(&inode->i_data, destoff,
-				   PAGE_CACHE_ALIGN(destoff + len) - 1);
-
-	lock_extent_range(src, off, len);
-
+	path->reada = 2;
 	/* clone data */
 	key.objectid = btrfs_ino(src);
 	key.type = BTRFS_EXTENT_DATA_KEY;
@@ -2840,14 +2758,120 @@ next:
 		key.offset++;
 	}
 	ret = 0;
+
 out:
 	btrfs_release_path(path);
+	btrfs_free_path(path);
+	vfree(buf);
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct inode *inode = fdentry(file)->d_inode;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct fd src_file;
+	struct inode *src;
+	int ret;
+	u64 len = olen;
+	u64 bs = root->fs_info->sb->s_blocksize;
+
+	/*
+	 * TODO:
+	 * - split compressed inline extents.  annoying: we need to
+	 *   decompress into destination''s address_space (the file offset
+	 *   may change, so source mapping won''t do), then recompress (or
+	 *   otherwise reinsert) a subrange.
+	 * - allow ranges within the same file to be cloned (provided
+	 *   they don''t overlap)?
+	 */
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	if (btrfs_root_readonly(root))
+		return -EROFS;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	ret = -EXDEV;
+	if (src_file.file->f_path.mnt != file->f_path.mnt)
+		goto out_fput;
+
+	src = src_file.file->f_dentry->d_inode;
+
+	ret = -EINVAL;
+	if (src == inode)
+		goto out_fput;
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ))
+		goto out_fput;
+
+	/* don''t make the dst file partly checksummed */
+	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !+	   
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+		goto out_fput;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EXDEV;
+	if (src->i_sb != inode->i_sb)
+		goto out_fput;
+
+	if (inode < src) {
+		mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&src->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&src->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
+	}
+
+	/* determine range to clone */
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out_unlock;
+	if (len == 0)
+		olen = len = src->i_size - off;
+	/* if we extend to eof, continue to block boundary */
+	if (off + len == src->i_size)
+		len = ALIGN(src->i_size, bs) - off;
+
+	/* verify the end result is block aligned */
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
+	    !IS_ALIGNED(destoff, bs))
+		goto out_unlock;
+
+	if (destoff > inode->i_size) {
+		ret = btrfs_cont_expand(inode, inode->i_size, destoff);
+		if (ret)
+			goto out_unlock;
+	}
+
+	/* truncate page cache pages from target inode range */
+	truncate_inode_pages_range(&inode->i_data, destoff,
+				   PAGE_CACHE_ALIGN(destoff + len) - 1);
+
+	lock_extent_range(src, off, len);
+
+	ret = btrfs_clone(src, inode, off, olen, len, destoff);
+
 	unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
 out_unlock:
 	mutex_unlock(&src->i_mutex);
 	mutex_unlock(&inode->i_mutex);
-	vfree(buf);
-	btrfs_free_path(path);
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 20:31 UTC

head link

[PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.

Partial credit for this patch goes to Gabriel de Perthuis
<g2p.code@gmail.com>
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/compression.c |  6 +++++-
 fs/btrfs/extent_io.c   | 41 +++++++++++++++++++++++++++++++----------
 fs/btrfs/extent_io.h   |  3 +++
 3 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15b9408..05819c3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -636,7 +636,11 @@ int btrfs_submit_compressed_read(struct inode *inode,
struct bio *bio,
 	faili = nr_pages - 1;
 	cb->nr_pages = nr_pages;
 
-	add_ra_bio_pages(inode, em_start + em_len, cb);
+	/* In the parent-locked case, we only locked the range we are
+	 * interested in.  In all other cases, we can opportunistically
+	 * cache decompressed data that goes beyond the requested range. */
+	if (!(bio_flags & EXTENT_BIO_PARENT_LOCKED))
+		add_ra_bio_pages(inode, em_start + em_len, cb);
 
 	/* include any pages we added in add_ra-bio_pages */
 	uncompressed_len = bio->bi_vcnt * PAGE_CACHE_SIZE;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cdee391..80ce106 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2643,11 +2643,12 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 	struct btrfs_ordered_extent *ordered;
 	int ret;
 	int nr = 0;
+	int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
 	size_t pg_offset = 0;
 	size_t iosize;
 	size_t disk_io_size;
 	size_t blocksize = inode->i_sb->s_blocksize;
-	unsigned long this_bio_flag = 0;
+	unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
 
 	set_page_extent_mapped(page);
 
@@ -2659,7 +2660,7 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 	}
 
 	end = page_end;
-	while (1) {
+	while (!parent_locked) {
 		lock_extent(tree, start, end);
 		ordered = btrfs_lookup_ordered_extent(inode, start);
 		if (!ordered)
@@ -2695,15 +2696,18 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 			kunmap_atomic(userpage);
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
-			unlock_extent_cached(tree, cur, cur + iosize - 1,
-					     &cached, GFP_NOFS);
+			if (!parent_locked)
+				unlock_extent_cached(tree, cur,
+						     cur + iosize - 1,
+						     &cached, GFP_NOFS);
 			break;
 		}
 		em = get_extent(inode, page, pg_offset, cur,
 				end - cur + 1, 0);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
-			unlock_extent(tree, cur, end);
+			if (!parent_locked)
+				unlock_extent(tree, cur, end);
 			break;
 		}
 		extent_offset = cur - em->start;
@@ -2711,7 +2715,7 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 		BUG_ON(end < cur);
 
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
-			this_bio_flag = EXTENT_BIO_COMPRESSED;
+			this_bio_flag |= EXTENT_BIO_COMPRESSED;
 			extent_set_compress_type(&this_bio_flag,
 						 em->compress_type);
 		}
@@ -2755,7 +2759,8 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
 			check_page_uptodate(tree, page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -2765,7 +2770,8 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			SetPageError(page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -2783,7 +2789,8 @@ static int __extent_read_full_page(struct extent_io_tree
*tree,
 			*bio_flags = this_bio_flag;
 		} else {
 			SetPageError(page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
@@ -2811,6 +2818,20 @@ int extent_read_full_page(struct extent_io_tree *tree,
struct page *page,
 	return ret;
 }
 
+int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page
*page,
+				 get_extent_t *get_extent, int mirror_num)
+{
+	struct bio *bio = NULL;
+	unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED;
+	int ret;
+
+	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
+				      &bio_flags);
+	if (bio)
+		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
+	return ret;
+}
+
 static noinline void update_nr_written(struct page *page,
 				      struct writeback_control *wbc,
 				      unsigned long nr_written)
@@ -3666,7 +3687,7 @@ int extent_readpages(struct extent_io_tree *tree,
 			continue;
 		for (i = 0; i < nr; i++) {
 			__extent_read_full_page(tree, pagepool[i], get_extent,
-					&bio, 0, &bio_flags);
+						&bio, 0, &bio_flags);
 			page_cache_release(pagepool[i]);
 		}
 		nr = 0;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 258c921..e3654bd 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -28,6 +28,7 @@
  */
 #define EXTENT_BIO_COMPRESSED 1
 #define EXTENT_BIO_TREE_LOG 2
+#define EXTENT_BIO_PARENT_LOCKED 4
 #define EXTENT_BIO_FLAG_SHIFT 16
 
 /* these are bit numbers for test/set bit */
@@ -198,6 +199,8 @@ int unlock_extent_cached(struct extent_io_tree *tree, u64
start, u64 end,
 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
 int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 			  get_extent_t *get_extent, int mirror_num);
+int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page
*page,
+				 get_extent_t *get_extent, int mirror_num);
 int __init extent_io_init(void);
 void extent_io_exit(void);
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 20:31 UTC

head link

[PATCH 4/4] btrfs: offline dedupe

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c           | 329 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  27 ++++
 2 files changed, 356 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e90c519..9351b69 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2456,6 +2459,54 @@ out:
 	return ret;
 }
 
+static noinline int fill_data(struct inode *inode, u64 off, u64 len,
+			      char *buffer)
+{
+	struct page *page;
+	void *addr;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!PageUptodate(page)) {
+			extent_read_full_page_nolock(tree, page,
+						     btrfs_get_extent, 0);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+out:
+	if (ret)
+		kfree(buffer);
+	return ret;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2476,6 +2527,282 @@ static inline void lock_extent_range(struct inode
*inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode1, loff1, len);
+		lock_extent_range(inode2, loff2, len);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+		lock_extent_range(inode1, loff1, len);
+	}
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff,
+			     char *src_buffer, char *dst_buffer)
+{
+	int ret;
+
+	/*
+	 * btrfs_clone() can''t handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = fill_data(src, loff, len, src_buffer);
+	if (ret)
+		goto out;
+
+	ret = fill_data(dst, dst_loff, len, dst_buffer);
+	if (ret)
+		goto out;
+
+	ret = memcmp(src_buffer, dst_buffer, len);
+	if (ret) {
+		ret = BTRFS_SAME_DATA_DIFFERS;
+		goto out;
+	}
+
+	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+out:
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	return ret;
+}
+
+static void *btrfs_kvmalloc(size_t size, gfp_t flags)
+{
+	void *ret;
+
+	ret = kmalloc(size, flags);
+	if (!ret)
+		ret = __vmalloc(size, flags, PAGE_KERNEL);
+	return ret;
+}
+
+static void btrfs_kvfree(void *ptr)
+{
+	if (is_vmalloc_addr(ptr))
+		vfree(ptr);
+	else
+		kfree(ptr);
+}
+
+#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct btrfs_ioctl_same_extent_info *info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int args_size;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	char *src_buffer = NULL;
+	char *dst_buffer = NULL;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (btrfs_root_readonly(BTRFS_I(src)->root))
+		return -EROFS;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp))) {
+		ret = -EFAULT;
+		goto out_drop_write;
+	}
+
+	args_size = sizeof(tmp) + (tmp.dest_count *
+			sizeof(struct btrfs_ioctl_same_extent_info));
+
+	/* Keep size of ioctl argument sane */
+	if (args_size > PAGE_CACHE_SIZE) {
+		ret = -E2BIG;
+		goto out_drop_write;
+	}
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args) {
+		ret = -ENOMEM;
+		goto out_drop_write;
+	}
+
+	ret = -EFAULT;
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		goto out;
+	/* Make sure args didn''t change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp)))
+		goto out;
+
+	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
+		goto out;
+
+	/* pre-format ''out'' fields to sane default values */
+	for (i = 0; i < args->dest_count; i++) {
+		info = &args->info[i];
+		info->bytes_deduped = 0;
+		info->status = 0;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation. 
+	 * This is intended to bound the entire ioctl to something sane.
+	 */
+	if (len > BTRFS_MAX_DEDUPE_LEN)
+		len = BTRFS_MAX_DEDUPE_LEN;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out;
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		goto out;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	src_buffer = btrfs_kvmalloc(len, GFP_NOFS);
+	dst_buffer = btrfs_kvmalloc(len, GFP_NOFS);
+	if (!src_buffer || !dst_buffer) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = 0;
+	for (i = 0; i < args->dest_count; i++) {
+		u64 dest_off;
+		u64 src_off;
+		u64 op_len;
+
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			info->status = -EBADF;
+			continue;
+		}
+
+		if (!(dst_file->f_mode & FMODE_WRITE)) {
+			info->status = -EINVAL;
+			goto next;
+		}
+
+		info->status = -EXDEV;
+		if (file->f_path.mnt != dst_file->f_path.mnt)
+			goto next;
+
+		dst = dst_file->f_dentry->d_inode;
+		if (src->i_sb != dst->i_sb)
+			goto next;
+
+		if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+			goto next;
+		}
+
+		info->status = -EINVAL;
+		/* don''t make the dst file partly checksummed */
+		if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !+		   
(BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
+			goto next;
+
+		dest_off = info->logical_offset;
+		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
+			goto next;
+		if (!IS_ALIGNED(dest_off, bs))
+			goto next;
+
+		/*
+		 * The purpose of this loop is to limit the number of
+		 * bytes we dedupe during a single call to
+		 * btrfs_extent_same().
+		 *
+		 * In order to memcmp the data we have to allocate a
+		 * pair of buffers. We don''t want to allocate too
+		 * large a buffer, so limiting the size for each
+		 * dedupe is an easy way to do this.
+		 */
+		src_off = off;
+		op_len = len;
+		while (op_len) {
+			u64 tmp_len;
+
+			tmp_len = op_len;
+			if (op_len > BTRFS_ONE_DEDUPE_LEN)
+				tmp_len = BTRFS_ONE_DEDUPE_LEN;
+
+			info->status = btrfs_extent_same(src, src_off, tmp_len,
+							 dst, dest_off,
+							 src_buffer, dst_buffer);
+			if (info->status == 0) {
+				info->bytes_deduped += tmp_len;
+			} else
+				break;
+
+			dest_off += tmp_len;
+			src_off += tmp_len;
+			op_len -= tmp_len;
+		}
+
+next:
+		fput(dst_file);
+		dst_file = NULL;
+	}
+
+	if (copy_to_user(argp, args, args_size))
+		ret = -EFAULT;
+
+out:
+	btrfs_kvfree(src_buffer);
+	btrfs_kvfree(dst_buffer);
+	if (dst_file)
+		fput(dst_file);
+	kfree(args);
+out_drop_write:
+	mnt_drop_write_file(file);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4151,6 +4478,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index fa3a5f9..5465bc2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -305,6 +305,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -510,5 +535,7 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gabriel de Perthuis

2013-Jun-11 20:56 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

Le 11/06/2013 22:31, Mark Fasheh a écrit :> Perhaps this isn''t a limiation per-se but extent-same requires
read/write
> access to the files we want to dedupe.  During my last series I had a
> conversation with Gabriel de Perthuis about access checking where we tried
> to maintain the ability for a user to run extent-same against a readonly
> snapshot. In addition, I reasoned that since the underlying data
won''t
> change (at least to the user) that we ought only require the files to be
> open for read.
> 
> What I found however is that neither of these is a great idea ;)
> 
> - We want to require that the inode be open for writing so that an
>   unprivileged user can''t do things like run dedupe on a
performance
>   sensitive file that they might only have read access to.  In addition I
>   could see it as kind of a surprise (non-standard behavior) to an
>   administrator that users could alter the layout of files they are only
>   allowed to read.
> 
> - Readonly snapshots won''t let you open for write anyway
(unsuprisingly,
>   open() returns -EROFS).  So that kind of kills the idea of them being
able
>   to open those files for write which we want to dedupe.
> 
> That said, I still think being able to run this against a set of readonly
> snapshots makes sense especially if those snapshots are taken for backup
> purposes. I''m just not sure how we can sanely enable it.
The check could be: if (fmode_write || cap_sys_admin).

This isn''t incompatible with mnt_want_write, that check is at the
level of the superblocks and vfsmount and not the subvolume fsid.
> Code review is very much appreciated. Thanks,
>      --Mark
> 
> 
> ChangeLog
> 
> - check that we have appropriate access to each file before deduping. For
>   the source, we only check that it is opened for read. Target files have
to
>   be open for write.
> 
> - don''t dedupe on readonly submounts (this is to maintain 
> 
> - check that we don''t dedupe files with different checksumming
states
>  (compare BTRFS_INODE_NODATASUM flags)
> 
> - get and maintain write access to the mount during the extent same
>   operation (mount_want_write())
> 
> - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
>   pass them through for re-use on every call to btrfs_extent_same().
(thanks
>   to David Sterba <dsterba@suse.cz> for reporting this
> 
> - As the read buffers could possibly be up to 1MB (depending on user
>   request), we now conditionally vmalloc them.
> 
> - removed redundant check for same inode. btrfs_extent_same() catches it
now
>   and bubbles the error up.
> 
> - remove some unnecessary printks
> 
> Changes from RFC to v1:
> 
> - don''t error on large length value in btrfs exent-same, instead
we just
>   dedupe the maximum allowed.  That way userspace doesn''t have to
worry
>   about an arbitrary length limit.
> 
> - btrfs_extent_same will now loop over the dedupe range at 1MB increments
(for
>   a total of 16MB per request)
> 
> - cleaned up poorly coded while loop in __extent_read_full_page() (thanks
to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - included two fixes from Gabriel de Perthuis <g2p.code@gmail.com>:
>    - allow dedupe across subvolumes
>    - don''t lock compressed pages twice when deduplicating
> 
> - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
>   This should also give us a bit more reserved bytes.
> 
> - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - Some more reserved bytes are now included as a result of some of my
>   cleanups. Quite possibly we could add a couple more.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 21:04 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis
wrote:> > What I found however is that neither of these is a great idea ;)
> > 
> > - We want to require that the inode be open for writing so that an
> >   unprivileged user can''t do things like run dedupe on a
performance
> >   sensitive file that they might only have read access to.  In
addition I
> >   could see it as kind of a surprise (non-standard behavior) to an
> >   administrator that users could alter the layout of files they are
only
> >   allowed to read.
> > 
> > - Readonly snapshots won''t let you open for write anyway
(unsuprisingly,
> >   open() returns -EROFS).  So that kind of kills the idea of them
being able
> >   to open those files for write which we want to dedupe.
> > 
> > That said, I still think being able to run this against a set of
readonly
> > snapshots makes sense especially if those snapshots are taken for
backup
> > purposes. I''m just not sure how we can sanely enable it.
> 
> The check could be: if (fmode_write || cap_sys_admin).
> 
> This isn''t incompatible with mnt_want_write, that check is at the
> level of the superblocks and vfsmount and not the subvolume fsid.
Oh ok that''s certainly better. I think we still have a problem though -
how
does a process gets write access to a file from a ro-snapshot? If I open a
file (as root) on a ro-snapshot on my test machine here I''ll get
-EROFS.

I''m a bit confused - how does mnt_want_write factor in here? I think
that''s
for a totally seperate kind of accounting, right?

Thanks for the quick reply :)
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gabriel de Perthuis

2013-Jun-11 21:31 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

Le 11/06/2013 23:04, Mark Fasheh a écrit :> On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis wrote:
>>> What I found however is that neither of these is a great idea ;)
>>>
>>> - We want to require that the inode be open for writing so that an
>>>   unprivileged user can''t do things like run dedupe on a
performance
>>>   sensitive file that they might only have read access to.  In
addition I
>>>   could see it as kind of a surprise (non-standard behavior) to an
>>>   administrator that users could alter the layout of files they are
only
>>>   allowed to read.
>>>
>>> - Readonly snapshots won''t let you open for write anyway
(unsuprisingly,
>>>   open() returns -EROFS).  So that kind of kills the idea of them
being able
>>>   to open those files for write which we want to dedupe.
>>>
>>> That said, I still think being able to run this against a set of
readonly
>>> snapshots makes sense especially if those snapshots are taken for
backup
>>> purposes. I''m just not sure how we can sanely enable it.
>>
>> The check could be: if (fmode_write || cap_sys_admin).
>>
>> This isn''t incompatible with mnt_want_write, that check is at
the
>> level of the superblocks and vfsmount and not the subvolume fsid.
> 
> Oh ok that''s certainly better. I think we still have a problem
though - how
> does a process gets write access to a file from a ro-snapshot? If I open a
> file (as root) on a ro-snapshot on my test machine here I''ll get
-EROFS.
Your first series did work in that case.
The process does get a read-only fd, but that''s no obstacle for the
ioctl.
> I''m a bit confused - how does mnt_want_write factor in here? I
think that''s
> for a totally seperate kind of accounting, right?
It doesn''t, it''s just that I had spent a few minutes checking
anyway.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-11 21:45 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

On Tue, Jun 11, 2013 at 11:31:41PM +0200, Gabriel de Perthuis
wrote:> Le 11/06/2013 23:04, Mark Fasheh a écrit :
> > On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis wrote:
> >>> What I found however is that neither of these is a great idea
;)
> >>>
> >>> - We want to require that the inode be open for writing so
that an
> >>>   unprivileged user can''t do things like run dedupe
on a performance
> >>>   sensitive file that they might only have read access to.  In
addition I
> >>>   could see it as kind of a surprise (non-standard behavior)
to an
> >>>   administrator that users could alter the layout of files
they are only
> >>>   allowed to read.
> >>>
> >>> - Readonly snapshots won''t let you open for write
anyway (unsuprisingly,
> >>>   open() returns -EROFS).  So that kind of kills the idea of
them being able
> >>>   to open those files for write which we want to dedupe.
> >>>
> >>> That said, I still think being able to run this against a set
of readonly
> >>> snapshots makes sense especially if those snapshots are taken
for backup
> >>> purposes. I''m just not sure how we can sanely enable
it.
> >>
> >> The check could be: if (fmode_write || cap_sys_admin).
> >>
> >> This isn''t incompatible with mnt_want_write, that check
is at the
> >> level of the superblocks and vfsmount and not the subvolume fsid.
> > 
> > Oh ok that''s certainly better. I think we still have a
problem though - how
> > does a process gets write access to a file from a ro-snapshot? If I
open a
> > file (as root) on a ro-snapshot on my test machine here I''ll
get -EROFS.
> 
> Your first series did work in that case.
> The process does get a read-only fd, but that''s no obstacle for
the ioctl.
Ahh, I had ignored the ''||'' in your check above. Basically
though you have
to have write access unless you''re the sysadmin, in which case open for
reading is enough. Makes sense -- I''ll try this.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-12 18:10 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

On Tue, Jun 11, 2013 at 02:31:34PM -0600, Mark Fasheh
wrote:> Hi,
> 
> The following series of patches implements in btrfs an ioctl to do
> offline deduplication of file extents.
> 
> To be clear, "offline" in this sense means that the file system
is
> mounted and running, but the dedupe is not done during file writes,
> but after the fact when some userspace software initiates a dedupe.
> 
> The primary patch is loosely based off of one sent by Josef Bacik back
> in January, 2011.
> 
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508
> 
> I''ve made significant updates and changes from the original. In
> particular the structure passed is more fleshed out, this series has a
> high degree of code sharing between itself and the clone code, and the
> locking has been updated.
> 
> 
> The ioctl accepts a struct:
> 
> struct btrfs_ioctl_same_args {
> 	__u64 logical_offset;	/* in - start of extent in source */
> 	__u64 length;		/* in - length of extent */
> 	__u16 dest_count;	/* in - total elements in info array */
> 	__u16 reserved1;
> 	__u32 reserved2;
> 	struct btrfs_ioctl_same_extent_info info[0];
> };
> 
> Userspace puts each duplicate extent (other than the source) in an
> item in the info array. As there can be multiple dedupes in one
> operation, each info item has it''s own status and
''bytes_deduped''
> member. This provides a number of benefits:
> 
> - We don''t have to fail the entire ioctl because one of the
dedupes failed.
> 
> - Userspace will always know how much progress was made on a file as we
always
>   return the number of bytes deduped.
> 
> 
> #define BTRFS_SAME_DATA_DIFFERS	1
> /* For extent-same ioctl */
> struct btrfs_ioctl_same_extent_info {
> 	__s64 fd;		/* in - destination file */
> 	__u64 logical_offset;	/* in - start of extent in destination */
> 	__u64 bytes_deduped;	/* out - total # of bytes we were able
> 				 * to dedupe from this file */
> 	/* status of this dedupe operation:
> 	 * 0 if dedup succeeds
> 	 * < 0 for error
> 	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> 	 */
> 	__s32 status;		/* out - see above description */
> 	__u32 reserved;
> };
> 
> 
> The kernel patches are based off Linux v3.9. At this point I''ve
tested the
> ioctl against a decent variety of files and conditions.
> 
> A git tree for the kernel changes can be found at:
> 
> https://github.com/markfasheh/btrfs-extent-same
> 
> 
> I have a userspace project, duperemove available at:
> 
> https://github.com/markfasheh/duperemove
> 
> Hopefully this can serve as an example of one possible usage of the ioctl.
> 
> duperemove takes a list of files as argument and will search them for
> duplicated extents. If given the ''-D'' switch, duperemove
will send dedupe
> requests for same extents and display the results.
> 
> Within the duperemove repo is a file, btrfs-extent-same.c that acts as
> a test wrapper around the ioctl. It can be compiled completely
> seperately from the rest of the project via "make
> btrfs-extent-same". This makes direct testing of the ioctl more
> convenient.
> 
> 
> Limitations
> 
> We can''t yet dedupe within the same file (that is, source and
destination
> are the same inode). This is due to a limitation in btrfs_clone().
> 
> 
> Perhaps this isn''t a limiation per-se but extent-same requires
read/write
> access to the files we want to dedupe.  During my last series I had a
> conversation with Gabriel de Perthuis about access checking where we tried
> to maintain the ability for a user to run extent-same against a readonly
> snapshot. In addition, I reasoned that since the underlying data
won''t
> change (at least to the user) that we ought only require the files to be
> open for read.
> 
> What I found however is that neither of these is a great idea ;)
> 
> - We want to require that the inode be open for writing so that an
>   unprivileged user can''t do things like run dedupe on a
performance
>   sensitive file that they might only have read access to.  In addition I
>   could see it as kind of a surprise (non-standard behavior) to an
>   administrator that users could alter the layout of files they are only
>   allowed to read.
> 
> - Readonly snapshots won''t let you open for write anyway
(unsuprisingly,
>   open() returns -EROFS).  So that kind of kills the idea of them being
able
>   to open those files for write which we want to dedupe.
> 
> That said, I still think being able to run this against a set of readonly
> snapshots makes sense especially if those snapshots are taken for backup
> purposes. I''m just not sure how we can sanely enable it.
> 
> 
> 
> Code review is very much appreciated. Thanks,
>      --Mark
> 
> 
> ChangeLog
> 
> - check that we have appropriate access to each file before deduping. For
>   the source, we only check that it is opened for read. Target files have
to
>   be open for write.
> 
> - don''t dedupe on readonly submounts (this is to maintain 
> 
> - check that we don''t dedupe files with different checksumming
states
>  (compare BTRFS_INODE_NODATASUM flags)
> 
> - get and maintain write access to the mount during the extent same
>   operation (mount_want_write())
> 
> - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
>   pass them through for re-use on every call to btrfs_extent_same().
(thanks
>   to David Sterba <dsterba@suse.cz> for reporting this
> 
> - As the read buffers could possibly be up to 1MB (depending on user
>   request), we now conditionally vmalloc them.
> 
> - removed redundant check for same inode. btrfs_extent_same() catches it
now
>   and bubbles the error up.
> 
> - remove some unnecessary printks
> 
> Changes from RFC to v1:
> 
> - don''t error on large length value in btrfs exent-same, instead
we just
>   dedupe the maximum allowed.  That way userspace doesn''t have to
worry
>   about an arbitrary length limit.
> 
> - btrfs_extent_same will now loop over the dedupe range at 1MB increments
(for
>   a total of 16MB per request)
> 
> - cleaned up poorly coded while loop in __extent_read_full_page() (thanks
to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - included two fixes from Gabriel de Perthuis <g2p.code@gmail.com>:
>    - allow dedupe across subvolumes
>    - don''t lock compressed pages twice when deduplicating
> 
> - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
>   This should also give us a bit more reserved bytes.
> 
> - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - Some more reserved bytes are now included as a result of some of my
>   cleanups. Quite possibly we could add a couple more.
Ok I''m relatively happy with this set, I just want to have an xfstest
on deck to
test it with.  Once I have an xfstest to run to verify its sanity I''ll
pull it
in.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Fasheh

2013-Jun-17 20:04 UTC

head link

Re: [PATCH 0/4] btrfs: offline dedupe v2

On Wed, Jun 12, 2013 at 02:10:37PM -0400, Josef Bacik
wrote:> On Tue, Jun 11, 2013 at 02:31:34PM -0600, Mark Fasheh wrote:
> > Hi,
> > 
> > The following series of patches implements in btrfs an ioctl to do
> > offline deduplication of file extents.
> 
> Ok I''m relatively happy with this set, I just want to have an
xfstest on deck to
> test it with.  Once I have an xfstest to run to verify its sanity
I''ll pull it
> in.  Thanks,
Awesome thanks Josef. Is it ok to include this patch as well, it implements
the semantics that Gabriel and I talked about. The patch is also on top of
the same git tree I have referenced in the original e-mails.

Btw, let me know if you want/need help writing an xfs test for this.
	--Mark

--
Mark Fasheh


From: Mark Fasheh <mfasheh@suse.de>

btrfs-extent-same: allow root to always dedupe

This will allow end users to run dedupe against their readonly snapshots - a
common method for making backups of btrfs subvolumes. Since it is impossible
for root to get a rw file descriptor on such a snapshot, this enables dedupe
from a privileged account to proceed with read access. Non root accounts
still have the same access restrictions - they cannot dedupe without write
access to the file.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9351b69..68931b9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2629,13 +2629,11 @@ static long btrfs_ioctl_file_extent_same(struct file
*file,
 	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
 	char *src_buffer = NULL;
 	char *dst_buffer = NULL;
+	bool is_admin = capable(CAP_SYS_ADMIN);
 
 	if (!(file->f_mode & FMODE_READ))
 		return -EINVAL;
 
-	if (btrfs_root_readonly(BTRFS_I(src)->root))
-		return -EROFS;
-
 	ret = mnt_want_write_file(file);
 	if (ret)
 		return ret;
@@ -2722,7 +2720,7 @@ static long btrfs_ioctl_file_extent_same(struct file
*file,
 			continue;
 		}
 
-		if (!(dst_file->f_mode & FMODE_WRITE)) {
+		if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
 			info->status = -EINVAL;
 			goto next;
 		}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Jul-15 20:55 UTC

head link

Re: [PATCH 4/4] btrfs: offline dedupe

Josef asked that I check out the offline dedup patches.  I hope that I
found the most recent posting in my archives :).

The first three patches seemed fine.  I have questions about the
read/compare/clone core:
> +		addr = kmap(page);
> +		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
> +		kunmap(page);
I think you need flush_dcache_page() here ''cause there could be
writable
user address mappings of these page cache pages.
> +static void btrfs_double_lock(struct inode *inode1, u64 loff1,
> +			      struct inode *inode2, u64 loff2, u64 len)
> +{
> +	if (inode1 < inode2) {
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode1, loff1, len);
> +		lock_extent_range(inode2, loff2, len);
> +	} else {
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode2, loff2, len);
> +		lock_extent_range(inode1, loff1, len);
> +	}
> +}
I''d do it a little differently to reduce the surface area for future
1/2
copy-paste typos.  (And maybe make it safe for when src == dst?)

	if (inode1 > inode2)
		swap(inode1, inode2)

	mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
	lock_extent_range(inode1, loff1, len);
	if (inode1 != inode2) {
		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
		lock_extent_range(inode2, loff2, len);
	}

If lockdep lets us do this, anyway :/.  (Then maybe a similar swap to
match in unlock, I''m not sure.)
> +	if (copy_from_user(&tmp,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   sizeof(tmp))) {
> +		ret = -EFAULT;
> +		goto out_drop_write;
> +	}
> +
> +	args_size = sizeof(tmp) + (tmp.dest_count *
> +			sizeof(struct btrfs_ioctl_same_extent_info));
> +
> +	/* Keep size of ioctl argument sane */
> +	if (args_size > PAGE_CACHE_SIZE) {
> +		ret = -E2BIG;
> +		goto out_drop_write;
> +	}
> +
> +	args = kmalloc(args_size, GFP_NOFS);
> +	if (!args) {
> +		ret = -ENOMEM;
> +		goto out_drop_write;
> +	}
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(args,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   args_size))
> +		goto out;
> +	/* Make sure args didn''t change magically between copies. */
> +	if (memcmp(&tmp, args, sizeof(tmp)))
> +		goto out;
> +
> +	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
> +		goto out;
> +
> +	/* pre-format ''out'' fields to sane default values */
> +	for (i = 0; i < args->dest_count; i++) {
> +		info = &args->info[i];
> +		info->bytes_deduped = 0;
> +		info->status = 0;
> +	}
I''d get rid of all this code by only copying each input argument on to
the stack as it''s needed and by getting rid of the writable output
struct fields.  (more on this later)
> +	/*
> +	 * Limit the total length we will dedupe for each operation. 
> +	 * This is intended to bound the entire ioctl to something sane.
> +	 */
> +	if (len > BTRFS_MAX_DEDUPE_LEN)
> +		len = BTRFS_MAX_DEDUPE_LEN;
[ ... ]
> +	src_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +	dst_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +	if (!src_buffer || !dst_buffer) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
Why use these allocated intermediate buffers at all?  Regardless of how
large a region you decide to lock and clone at a time, you can get page
refs for both files and use kmap_atomic() to compare without copying. 

And if you don''t have the buffers, why not lock and clone as much as
the user asked for?

I could understand wanting to reduce lock hold times, but you could do
that by doing the initial io waits on shared page cache references and
only using the expensive locks to verify the pages and perform the
clone.  But hopefully that complexity isn''t really needed?
> +		if (S_ISDIR(dst->i_mode)) {
> +			info->status = -EISDIR;
> +			goto next;
> +		}
Does anything stop us from getting here with symlinks?  It looks like
btrfs_symlink() helpfully overrides i_fop with the _file_ operations
which have .ioctl.  I didn''t actually try it.
> +	if (copy_to_user(argp, args, args_size))
> +		ret = -EFAULT;
I don''t think it should be possible to return -EFAULT after all the
work
is done just because some input memory was accidentally read-only for
whatever reason.
> +#define BTRFS_SAME_DATA_DIFFERS	1
> +/* For extent-same ioctl */
> +struct btrfs_ioctl_same_extent_info {
> +	__s64 fd;		/* in - destination file */
> +	__u64 logical_offset;	/* in - start of extent in destination */
> +	__u64 bytes_deduped;	/* out - total # of bytes we were able
> +				 * to dedupe from this file */
> +	/* status of this dedupe operation:
> +	 * 0 if dedup succeeds
> +	 * < 0 for error
> +	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> +	 */
> +	__s32 status;		/* out - see above description */
> +	__u32 reserved;
> +};
As I said, I''d get rid of the output fields.  Like the other vectored
io
syscalls, the return value can indicate the number of initial
consecutive bytes that worked.  When no progess was made then it can
return errors.  Userspace is left to sort out the resulting state and
figure out the extents to retry in exactly the same way that it found
the initial extents to attempt to dedupe in the first place.

(And imagine strace trying to print the inputs and outputs.  Poor, poor,
strace! :))

I hope this helps!

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gabriel de Perthuis

2013-Jul-17 00:14 UTC

head link

Re: [PATCH 4/4] btrfs: offline dedupe

On Mon, 15 Jul 2013 13:55:51 -0700, Zach Brown wrote:> I''d get rid of all this code by only copying each input argument
on to
> the stack as it''s needed and by getting rid of the writable output
> struct fields.  (more on this later)
> As I said, I''d get rid of the output fields.  Like the other
vectored io
> syscalls, the return value can indicate the number of initial
> consecutive bytes that worked.  When no progess was made then it can
> return errors.  Userspace is left to sort out the resulting state and
> figure out the extents to retry in exactly the same way that it found
> the initial extents to attempt to dedupe in the first place.
> 
> (And imagine strace trying to print the inputs and outputs.  Poor, poor,
> strace! :))
The dedup branch that uses this syscall[1] doesn''t compare files
before submitting them anymore (the kernel will do it, and ranges
may not fit in cache once I get rid of an unnecessary loop).

I don''t have strong opinions on the return style, but it would be
good to have the syscall always make progress by finding at least
one good range before bailing out, and signaling which files were
involved.  With those constraints, the current struct seems like the
cleanest way to pass the data.  The early return you suggest is a
good idea if Mark agrees, but the return condition should be
something like: if one range with bytes_deduped != 0 doesn''t get
bytes_deduped incremented by iteration_len in this iteration, bail
out.
That''s sufficient to guarantee progress and to know which ranges
were involved.
> I hope this helps!
> 
> - z
Thank you and everyone involved for the progress on this.

[1] https://github.com/g2p/bedup/tree/wip/dedup-syscall


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2013 - [PATCH 0/4] btrfs: offline dedupe v2

[PATCH 0/4] btrfs: offline dedupe v2

[PATCH 1/4] btrfs: abtract out range locking in clone ioctl()

[PATCH 2/4] btrfs_ioctl_clone: Move clone code into it''s own function

[PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

[PATCH 4/4] btrfs: offline dedupe

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 0/4] btrfs: offline dedupe v2

Re: [PATCH 4/4] btrfs: offline dedupe

Re: [PATCH 4/4] btrfs: offline dedupe