thr3ads.net - Btrfs devel - Offline Deduplication for Btrfs [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Josef Bacik

2011-Jan-05 16:36 UTC

Offline Deduplication for Btrfs

Here are patches to do offline deduplication for Btrfs.  It works well for the
cases it''s expected to, I''m looking for feedback on the ioctl
interface and
such, I''m well aware there are missing features for the userspace app
(like
being able to set a different blocksize).  If this interface is acceptable I
will flesh out the userspace app a little more, but I believe the kernel side is
ready to go.

Basically I think online dedup is huge waste of time and completely useless.
You are going to want to do different things with different data.  For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.  So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point.  This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.

For the userspace app it only does 64k blocks, or whatever the largest area it
can read out of a file.  I''m going to extend this to do the following
things in
the near future

1) Take the blocksize as an argument so we can have bigger/smaller blocks
2) Have an option to _only_ honor the blocksize, don''t try and dedup
smaller
blocks
3) Use fiemap to try and dedup extents as a whole and just ignore specific
blocksizes
4) Use fiemap to determine what would be the most optimal blocksize for the data
you want to dedup.

I''ve tested this out on my setup and it seems to work well.  I
appreciate any
feedback you may have.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-05 16:36 UTC

head link

[PATCH] Btrfs: add extent-same ioctl for dedup

This adds the ability for userspace to tell btrfs which extents match eachother.
You pass in

-a logical offset
-a length
-a hash type (currently only sha256 is supported)
-the hash
-a list of file descriptors with their logical offset

and this ioctl will split up the extent on the target file and then link all of
the files with the target files extent and free up their original extent.  The
hash is to make sure nothing has changed between the userspace app running and
we doing the actual linking, so we hash everything to make sure it''s
all still
the same.  This doesn''t work in a few key cases

1) Any data transformation whatsoever.  This includes compression or any
encryption that happens later on.  This is just to make sure we''re not
deduping
things that don''t turn out to be the same stuff on disk as it is
uncompressed/decrypted.

2) Across subvolumes.  This can be fixed later, but this is just to keep odd
problems from happening, like oh say trying to dedup things that are snapshots
of eachother already.  Nothing bad will happen, it''s just needless work
so just
don''t allow it for the time being.

3) If the target file''s data is split across extents.  We need one
extent to
point everybody at, so if the target file''s data spans different
extents we
won''t work.  In this case I return ERANGE so the userspace app can call
defrag
and then try again, but currently I don''t do that, so that will have to
be fixed
at some point.

I think thats all of the special cases.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/btrfs/ioctl.c |  662 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   19 ++
 2 files changed, 681 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f1c9bb4..15e1c63 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -40,6 +40,8 @@
 #include <linux/xattr.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
 #include "compat.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -2231,6 +2233,664 @@ static noinline long btrfs_ioctl_wait_sync(struct file
*file, void __user *argp)
 	return btrfs_wait_for_commit(root, transid);
 }
 
+static noinline int check_hash(struct inode *inode,
+			       struct btrfs_ioctl_same_args *args,
+			       u64 off, u64 len)
+{
+	struct hash_desc desc;
+	struct scatterlist sg;
+	struct page *page;
+	void *addr;
+	char *buffer;
+	char *digest = NULL;
+	char *hash = NULL;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	int digest_len;
+
+	buffer = kmalloc(len, GFP_NOFS);
+	if (!buffer)
+		return -ENOMEM;
+
+	switch (args->hash_type) {
+	case BTRFS_SAME_EXTENT_HASH_SHA256:
+		desc.tfm = crypto_alloc_hash("sha256", 0, CRYPTO_ALG_ASYNC);
+		break;
+	default:
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!desc.tfm) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	digest_len = crypto_hash_digestsize(desc.tfm);
+	if (digest_len != args->hash_len) {
+		ret = -EINVAL;
+		goto out_crypto;
+	}
+
+	hash = kmalloc(digest_len, GFP_NOFS);
+	if (!hash) {
+		ret = -ENOMEM;
+		goto out_crypto;
+	}
+
+	digest = kmalloc(digest_len, GFP_NOFS);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto out_crypto;
+	}
+
+	if (copy_from_user(hash, (char __user *)args->hash, digest_len)) {
+		ret = -EFAULT;
+		goto out_crypto;
+	}
+
+	desc.flags = 0;
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out_crypto;
+		}
+
+		if (!PageUptodate(page)) {
+			btrfs_readpage(NULL, page);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out_crypto;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+	sg_init_one(&sg, buffer, len);
+
+	if (crypto_hash_digest(&desc, &sg, sg.length, digest)) {
+		ret = -EINVAL;
+		goto out_crypto;
+	}
+
+	if (memcmp(digest, hash, digest_len)) {
+		printk(KERN_ERR "hash''s don''t match\n");
+		ret = -EINVAL;
+	}
+
+out_crypto:
+	kfree(digest);
+	kfree(hash);
+	crypto_free_hash(desc.tfm);
+out:
+	kfree(buffer);
+	printk(KERN_ERR "mark 4\n");
+	return ret;
+}
+
+static noinline int split_extent(struct btrfs_trans_handle *trans,
+				 struct inode *inode, u64 start, u64 end,
+				 u64 *bytenr, u64 *bytes, u64 *offset)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 extent_start;
+	u64 extent_end;
+	u64 extent_offset;
+	u64 search_start = start;
+	u64 disk_bytenr;
+	u64 disk_bytes;
+	int ret;
+	int recow;
+	int extent_type;
+
+	btrfs_drop_extent_cache(inode, start, end - 1, 0);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		recow = 0;
+		ret = btrfs_lookup_file_extent(trans, root, path,
+					       inode->i_ino, search_start, 1);
+		if (ret < 0)
+			break;
+		if (ret > 0 && path->slots[0] > 0) {
+			path->slots[0]--;
+		} else if (ret > 0 && path->slots[0] == 0) {
+			ret = btrfs_prev_leaf(root, path);
+			if (ret < 0)
+				break;
+			if (ret > 0) {
+				ret = 0;
+				break;
+			}
+		}
+
+next_slot:
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				break;
+			if (ret > 0) {
+				ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid > inode->i_ino ||
+		    key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
+			break;
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		if (extent_type != BTRFS_FILE_EXTENT_REG) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (btrfs_file_extent_compression(leaf, fi) ||
+		    btrfs_file_extent_encryption(leaf, fi) ||
+		    btrfs_file_extent_other_encoding(leaf, fi)) {
+			printk(KERN_ERR "cannot dedup transformed extents\n");
+			ret = -EINVAL;
+			break;
+		}
+
+		extent_start = key.offset;
+		extent_end = extent_start +
+			btrfs_file_extent_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		disk_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+
+		if (extent_end < search_start) {
+			path->slots[0]++;
+			goto next_slot;
+		}
+
+		search_start = max(key.offset, start);
+		if (recow) {
+			btrfs_release_path(root, path);
+			continue;
+		}
+
+		/*
+		 *     | - range to split - |
+		 *  | -------- extent -------- |
+		 */
+		if (start > key.offset && end < extent_end) {
+			struct btrfs_key new_key;
+
+			memcpy(&new_key, &key, sizeof(new_key));
+			new_key.offset = start;
+			ret = btrfs_duplicate_item(trans, root, path,
+						   &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi, start -
+							key.offset);
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += start - key.offset;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - start);
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+
+			btrfs_mark_buffer_dirty(leaf);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino,
+						   extent_offset);
+			if (ret) {
+				int err;
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+					extent_end - extent_start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+
+			new_key.offset = end;
+			ret = btrfs_duplicate_item(trans, root, path, &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							end - start);
+
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += end - start;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - end);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino, 0);
+			if (ret) {
+				int err;
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+						extent_end - start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+			btrfs_mark_buffer_dirty(leaf);
+			ret = 0;
+			break;
+		}
+
+		/*
+		 * | - range to split -|
+		 * | ------ extent ------|
+		 */
+		if (start == key.offset && end < extent_end) {
+			struct btrfs_key new_key;
+
+			memcpy(&new_key, &key, sizeof(new_key));
+			new_key.offset = end;
+			ret = btrfs_duplicate_item(trans, root, path,
+						   &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							end - start);
+
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += end - start;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - end);
+			btrfs_mark_buffer_dirty(leaf);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino,
+						   extent_offset);
+			if (ret) {
+				int err;
+
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+						extent_end - start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+			ret = 0;
+			break;
+		}
+
+		if (start == key.offset && end == extent_end) {
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+			ret = 0;
+			break;
+		}
+
+		ret = -ERANGE;
+		break;
+	}
+
+	btrfs_free_path(path);
+	return ret;
+}
+
+static noinline int link_extent(struct btrfs_trans_handle *trans,
+				struct inode *inode, u64 disk_bytenr,
+				u64 disk_bytes, u64 extent_offset,
+				u64 offset, u64 len)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_key key;
+	u64 hint;
+	int ret;
+	int err;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	ret = btrfs_inc_extent_ref(trans, root, disk_bytenr, disk_bytes, 0,
+				   root->root_key.objectid, inode->i_ino,
+				   extent_offset);
+	if (ret)
+		return ret;
+
+	ret = btrfs_drop_extents(trans, inode, offset, offset + len,
+				 &hint, 1);
+	if (ret) {
+		err = ret;
+		btrfs_free_path(path);
+		goto out;
+	}
+
+	key.objectid = inode->i_ino;
+	key.offset = offset;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*fi));
+
+	/*
+	 * Yes I know, it sucks, but if we don''t do this we''ll lose
data, we
+	 * need to come up with a way to undo the drop_extents so that if this
+	 * part fails we can still at least have the old data.
+	 */
+	BUG_ON(ret);
+
+	leaf = path->nodes[0];
+	fi = btrfs_item_ptr(leaf, path->slots[0],
+			    struct btrfs_file_extent_item);
+
+	btrfs_set_file_extent_generation(leaf, fi, trans->transid);
+	btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG);
+	btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
+	btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_bytes);
+	btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+	btrfs_set_file_extent_num_bytes(leaf, fi, len);
+	btrfs_set_file_extent_ram_bytes(leaf, fi, len);
+	btrfs_set_file_extent_compression(leaf, fi, 0);
+	btrfs_set_file_extent_encryption(leaf, fi, 0);
+	btrfs_set_file_extent_other_encoding(leaf, fi, 0);
+
+	btrfs_mark_buffer_dirty(leaf);
+	inode_add_bytes(inode, len);
+	btrfs_free_path(path);
+
+	return 0;
+out:
+	ret = btrfs_free_extent(trans, root, disk_bytenr, disk_bytes, 0,
+				root->root_key.objectid, inode->i_ino,
+				extent_offset);
+	if (ret)
+		err = ret;
+	return err;
+}
+
+static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
+{
+	while (1) {
+		struct btrfs_ordered_extent *ordered;
+		lock_extent(&BTRFS_I(inode)->io_tree, off,
+			    off + len - 1, GFP_NOFS);
+		ordered = btrfs_lookup_first_ordered_extent(inode,
+							    off + len - 1);
+		if (!ordered &&
+		    !test_range_bit(&BTRFS_I(inode)->io_tree, off,
+				    off + len - 1, EXTENT_DELALLOC, 0, NULL))
+			break;
+		unlock_extent(&BTRFS_I(inode)->io_tree, off, off + len -1,
+			      GFP_NOFS);
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		btrfs_wait_ordered_range(inode, off, len);
+	}
+}
+
+static noinline long btrfs_ioctl_file_extent_same(struct file *file,
+						  void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct inode *src = file->f_dentry->d_inode;
+	struct btrfs_root *root;
+	struct file *dst_file;
+	struct inode *dst_inode;
+	struct btrfs_trans_handle *trans;
+	u64 off;
+	u64 len;
+	u64 bytenr;
+	u64 bytes;
+	u64 extent_offset;
+	int args_size = 0;
+	int drop_ref = 0;
+	int i;
+	int ret;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp)))
+		return -EFAULT;
+
+	if (tmp.hash_type != BTRFS_SAME_EXTENT_HASH_SHA256)
+		return -EINVAL;
+
+	args_size = sizeof(tmp) + (tmp.total_files *
+			sizeof(struct btrfs_ioctl_file_extent_info));
+
+	/* lets not let this get out of hand */
+	if (args_size > PAGE_CACHE_SIZE)
+		return -ENOMEM;
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args)
+		return -ENOMEM;
+
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		return -EFAULT;
+
+	/* Make sure args didn''t change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp))) {
+		kfree(args);
+		return -EINVAL;
+	}
+
+	if (S_ISDIR(src->i_mode)) {
+		kfree(args);
+		return -EISDIR;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+	root = BTRFS_I(src)->root;
+
+	/*
+	 * Since we have to hash the extent to make sure it''s still right, we
+	 * have to allocate a buffer to hold the data, so let''s not let that
+	 * buffer be larger than 1mb just to make sure nobody abuses this.
+	 */
+	if (len > 1 * 1024 * 1024)
+		return -ENOMEM;
+
+	lock_extent_range(src, off, len);
+
+	if (check_hash(src, args, off, len)) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		kfree(args);
+		return -EINVAL;
+	}
+
+	trans = btrfs_start_transaction(root, args->total_files + 1);
+	if (IS_ERR(trans)) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		kfree(args);
+		return PTR_ERR(trans);
+	}
+
+	ret = split_extent(trans, src, off, off + len, &bytenr, &bytes,
+			   &extent_offset);
+	if (ret) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		btrfs_end_transaction(trans, root);
+		kfree(args);
+		return ret;
+	}
+
+	ret = btrfs_inc_extent_ref(trans, root, bytenr, bytes, 0,
+				   root->root_key.objectid, src->i_ino,
+				   extent_offset);
+	unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1, GFP_NOFS);
+	if (ret) {
+		btrfs_end_transaction(trans, root);
+		kfree(args);
+		return ret;
+	}
+
+	drop_ref = 1;
+	for (i = 0; i < args->total_files; i++) {
+		struct btrfs_ioctl_file_extent_info *info;
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+		dst_inode = dst_file->f_dentry->d_inode;
+		if (S_ISDIR(dst_inode->i_mode)) {
+			fput(dst_file);
+			printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+
+		if (BTRFS_I(dst_inode)->root != root) {
+			fput(dst_file);
+			printk(KERN_ERR "btrfs: cannot dedup across subvolumes"
+			      " %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+
+		off = info->logical_offset;
+		lock_extent_range(dst_inode, off, len);
+		if (check_hash(dst_inode, args, off, len)) {
+			printk(KERN_ERR "btrfs: hash for fd %lld does not "
+			       "match\n", info->fd);
+			unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+				      off + len - 1, GFP_NOFS);
+			fput(dst_file);
+			continue;
+		}
+
+		ret = link_extent(trans, dst_inode, bytenr, bytes,
+				  extent_offset, off, len);
+		if (ret) {
+			unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+				      off + len - 1, GFP_NOFS);
+			fput(dst_file);
+			break;
+		}
+
+		unlock_extent(&BTRFS_I(dst_inode)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+
+		fput(dst_file);
+
+		if (drop_ref) {
+			ret = btrfs_free_extent(trans, root, bytenr, bytes, 0,
+						root->root_key.objectid,
+						src->i_ino, extent_offset);
+			if (ret)
+				break;
+			drop_ref = 0;
+		}
+	}
+
+	if (drop_ref) {
+		btrfs_free_extent(trans, root, bytenr, bytes, 0,
+				  root->root_key.objectid, src->i_ino,
+				  extent_offset);
+	}
+
+	btrfs_end_transaction(trans, root);
+
+	kfree(args);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -2287,6 +2947,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_start_sync(file, argp);
 	case BTRFS_IOC_WAIT_SYNC:
 		return btrfs_ioctl_wait_sync(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 17c99eb..ca9b034 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -145,6 +145,23 @@ struct btrfs_ioctl_space_args {
 	struct btrfs_ioctl_space_info spaces[0];
 };
 
+#define BTRFS_SAME_EXTENT_HASH_SHA256	1
+
+struct btrfs_ioctl_file_extent_info {
+	__s64 fd;
+	__u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;
+	__u64 length;
+	__u64 total_files;
+	u32 hash_len;
+	u8 hash_type;
+	char *hash;
+	struct btrfs_ioctl_file_extent_info info[0];
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -189,4 +206,6 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
 				   struct btrfs_ioctl_async_vol_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+					struct btrfs_ioctl_same_args)
 #endif
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-05 16:36 UTC

head link

[PATCH] Btrfs-progs: add dedup functionality

This program very basically does dedup.  It searches a directory recursively and
scans all of the files looking for 64k extents and hashing them to figure out if
there are any duplicates.  After that it calls the btrfs same extent ioctl for
all of the duplicates in order to dedup the space on disk.  There is alot more
that can be done with this, like changing the block size and so on, but for now
this will get us started.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 Makefile |    8 +-
 dedup.c  |  658 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ioctl.h  |   19 ++
 3 files changed, 683 insertions(+), 2 deletions(-)
 create mode 100644 dedup.c

diff --git a/Makefile b/Makefile
index 525676e..b9a51c4 100644
--- a/Makefile
+++ b/Makefile
@@ -15,10 +15,11 @@ INSTALL= install
 prefix ?= /usr/local
 bindir = $(prefix)/bin
 LIBS=-luuid
+GCRYPT_LIBS=`libgcrypt-config --libs`
+GCRYPT_CFLAGS=`libgcrypt-config --cflags`
 
 progs = btrfsctl mkfs.btrfs btrfs-debug-tree btrfs-show btrfs-vol btrfsck \
-	btrfs \
-	btrfs-map-logical
+	btrfs btrfs-map-logical btrfs-dedup
 
 # make C=1 to enable sparse
 ifdef C
@@ -50,6 +51,9 @@ btrfs-vol: $(objects) btrfs-vol.o
 btrfs-show: $(objects) btrfs-show.o
 	gcc $(CFLAGS) -o btrfs-show btrfs-show.o $(objects) $(LDFLAGS) $(LIBS)
 
+btrfs-dedup: $(objects) dedup.o
+	gcc $(CFLAGS) -o btrfs-dedup dedup.c $(objects) $(LDFLAGS) $(GCRYPT_LIBS)
$(GCRYPT_CFLAGS) $(LIBS)
+
 btrfsck: $(objects) btrfsck.o
 	gcc $(CFLAGS) -o btrfsck btrfsck.o $(objects) $(LDFLAGS) $(LIBS)
 
diff --git a/dedup.c b/dedup.c
new file mode 100644
index 0000000..5385e28
--- /dev/null
+++ b/dedup.c
@@ -0,0 +1,658 @@
+/*
+ * Copyright (C) 2011 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#define _GNU_SOURCE 1
+#define X_OPEN_SOURCE 500
+
+#include <linux/fs.h>
+#include <linux/fiemap.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <gcrypt.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include "rbtree.h"
+#include "ioctl.h"
+
+struct file_entry {
+	ino_t ino;
+	unsigned long pos;
+	unsigned long len;
+	unsigned long physical;
+	char *hash;
+	int entries;
+	int fd;
+	struct file_entry *next;
+	struct rb_node n;
+};
+
+struct file {
+	char *name;
+	ino_t ino;
+	struct rb_node n;
+};
+
+static unsigned int digest_len;
+static char *digest;
+static struct rb_root hash_tree;
+static struct rb_root file_tree;
+unsigned int buffer_len = 65536;
+
+static void print_hash(uint64_t *hash)
+{
+	int i;
+
+	for (i = 0; i < 8; i++) {
+		printf("%llx", (unsigned long long)hash[i]);
+	}
+}
+
+static char *hash_buffer(void *buffer, int len)
+{
+	gcry_md_hash_buffer(GCRY_MD_SHA256, digest, buffer, len);
+
+	return strndup(digest, digest_len);
+}
+
+static struct rb_node *hash_tree_insert(struct file_entry *fe)
+{
+	struct rb_node **p = &hash_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct file_entry *entry;
+
+	while (*p) {
+		int cmp;
+
+		parent = *p;
+		entry = rb_entry(parent, struct file_entry, n);
+
+		cmp = memcmp(fe->hash, entry->hash, digest_len);
+
+		if (cmp < 0)
+			p = &(*p)->rb_left;
+		else if (cmp > 0)
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(&fe->n, parent, p);
+	rb_insert_color(&fe->n, &hash_tree);
+
+	return NULL;
+}
+
+#if 0
+static unsigned long logical_to_physical(int fd, unsigned long logical,
+					 unsigned long len)
+{
+	struct fiemap *map;
+	struct fiemap_extent *extent;
+	unsigned long physical = 0;
+	int size = sizeof(struct fiemap) + sizeof(struct fiemap_extent);
+	int ret;
+
+	map = malloc(sizeof(struct fiemap) + sizeof(struct fiemap_extent));
+	if (!map)
+		return 0;
+
+	memset(map, 0, size);
+	map->fm_start = logical;
+	map->fm_length = len;
+	map->fm_flags = FIEMAP_FLAG_SYNC;
+	map->fm_extent_count = 1;
+	ret = ioctl(fd, FS_IOC_FIEMAP, map);
+	if (ret) {
+		fprintf(stderr, "failed to do fiemap %d\n", errno);
+		return 0;
+	}
+	extent = &map->fm_extents[0];
+
+	physical = extent->fe_physical;
+	if (extent->fe_logical < logical)
+		physical +=  (logical - extent->fe_logical);
+	free(map);
+
+	return physical;
+}
+#endif
+
+static struct rb_node *file_tree_insert(struct file *file)
+{
+	struct rb_node **p = &file_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct file *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct file, n);
+
+		if (file->ino < entry->ino)
+			p = &(*p)->rb_left;
+		else if (file->ino > entry->ino)
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(&file->n, parent, p);
+	rb_insert_color(&file->n, &file_tree);
+
+	return NULL;
+}
+
+static struct file *file_tree_search(ino_t ino)
+{
+	struct rb_node *node = file_tree.rb_node;
+	struct file *entry;
+
+	while (node) {
+		entry = rb_entry(node, struct file, n);
+		if (ino < entry->ino)
+			node = node->rb_left;
+		else if (ino > entry->ino)
+			node = node->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+static int hash_file(char *filename)
+{
+	char *buffer;
+	struct rb_node *node;
+	struct file *file;
+	int fd;
+	ssize_t bytes;
+	size_t pos = 0;
+	struct stat st;
+	ino_t ino;
+
+	buffer = malloc(buffer_len);
+	if (!buffer) {
+		fprintf(stderr, "failed to alloc buffer\n");
+		return 1;
+	}
+
+	file = malloc(sizeof(*file));
+	if (!file) {
+		free(buffer);
+		fprintf(stderr, "failed to alloc file thing\n");
+		return 1;
+	}
+
+	fd = open(filename, O_RDONLY);
+	if (fd < 0) {
+		free(buffer);
+		return 1;
+	}
+
+	if (fstat(fd, &st) < 0) {
+		fprintf(stderr, "failed to stat\n");
+		free(buffer);
+		return 1;
+	}
+
+	ino = st.st_ino;
+	file->ino = ino;
+
+	node = file_tree_insert(file);
+	if (node) {
+		printf("wtf?\n");
+		free(file);
+		free(buffer);
+		close(fd);
+		return 0;
+	}
+
+	file->name = strdup(filename);
+
+	while ((bytes = read(fd, buffer, buffer_len)) > 0) {
+		struct file_entry *entry = malloc(sizeof(struct file_entry));
+
+		if (!entry) {
+			fprintf(stderr, "failed to allocate entry\n");
+			bytes = -1;
+			break;
+		}
+
+		entry->ino = ino;
+		entry->pos = pos;
+		entry->len = bytes;
+		entry->hash = hash_buffer(buffer, bytes);
+		if (!entry->hash) {
+			fprintf(stderr, "failed to allocate hash\n");
+			bytes = -1;
+			break;
+		}
+		entry->next = NULL;
+		entry->entries = 0;
+		node = hash_tree_insert(entry);
+		if (node) {
+			struct file_entry *existing;
+
+			existing = rb_entry(node, struct file_entry, n);
+			free(entry->hash);
+			entry->hash = NULL;
+			entry->next = existing->next;
+			existing->next = entry;
+			existing->entries++;
+		} else {
+//			entry->physical = logical_to_physical(fd, pos, bytes);
+		}
+		pos += bytes;
+	}
+
+	close(fd);
+	free(buffer);
+
+	if (bytes < 0)
+		printf("Bytes negative, error %d\n", errno);
+	return (bytes < 0);
+}
+
+static void print_stats()
+{
+	struct rb_node *node;
+	int total_extents = 0;
+	int total_duplicates = 0;
+
+	for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		total_extents++;
+		if (entry->entries)
+			total_duplicates += entry->entries;
+	}
+
+	printf("Total extents hashed:\t%d\n", total_extents);
+	printf("Total duplicates:\t%d\n", total_duplicates);
+	printf("Total space saved:\t%d\n", total_duplicates * buffer_len);
+
+}
+
+static void free_hash_tree()
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&hash_tree))) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		rb_erase(node, &hash_tree);
+		do {
+			struct file_entry *temp;
+
+			if (entry->next == NULL)
+				break;
+			temp = entry->next;
+			free(entry->hash);
+			free(entry);
+			entry = temp;
+		} while (1);
+
+		free(entry->hash);
+		free(entry);
+	}
+}
+
+static void free_file_tree()
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&file_tree))) {
+		struct file *entry;
+		entry = rb_entry(node, struct file, n);
+		rb_erase(node, &file_tree);
+//		printf("freeing %s, ino %lu\n", entry->name, entry->ino);
+		free(entry->name);
+		free(entry);
+	}
+}
+
+static void verify_match(struct file_entry *fe)
+{
+	struct file_entry *orig = fe;
+	struct file *file;
+	char *buffer;
+	char *temp;
+	ssize_t bytes;
+	int fd;
+
+	buffer = malloc(buffer_len);
+	if (!buffer)
+		return;
+
+	temp = malloc(buffer_len);
+	if (!temp) {
+		free(buffer);
+		return;
+	}
+
+	file = file_tree_search(fe->ino);
+	if (!file) {
+		fprintf(stderr, "couldn''t find file, weird %lu\n",
fe->ino);
+		goto out;
+	}
+
+	fd = open(file->name, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "failed to open%s\n", file->name);
+		goto out;
+	}
+
+	bytes = pread(fd, buffer, fe->len, fe->pos);
+	close(fd);
+	if (bytes < fe->len) {
+		fprintf(stderr, "short read\n");
+		goto out;
+	}
+
+	while (fe->next != NULL) {
+		struct file_entry *prev = fe;
+
+		fe = fe->next;
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldn''t find file, weird\n");
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		fd = open(file->name, O_RDONLY);
+		if (fd < 0) {
+			fprintf(stderr, "couldn''t open %s\n", file->name);
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		bytes = pread(fd, temp, fe->len, fe->pos);
+		close(fd);
+		if (bytes < fe->len) {
+			fprintf(stderr, "short read\n");
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		if (memcmp(buffer, temp, fe->len)) {
+			fprintf(stderr, "manual compare doesn''t match: %s\n",
+				file->name);
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+	}
+
+	if (!orig->entries) {
+		rb_erase(&orig->n, &hash_tree);
+		free(orig->hash);
+		free(orig);
+	}
+out:
+	free(buffer);
+	free(temp);
+}
+
+static void prune_entries()
+{
+	struct rb_node *node;
+
+	node = rb_first(&hash_tree);
+	while (node) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		if (entry->entries) {
+			node = rb_next(node);
+			verify_match(entry);
+			continue;
+		}
+
+		node = rb_next(node);
+		rb_erase(&entry->n, &hash_tree);
+		free(entry->hash);
+		free(entry);
+	}
+}
+
+static void print_hashes()
+{
+	struct rb_node *hash_node;
+
+	for (hash_node = rb_first(&hash_tree); hash_node;
+	     hash_node = rb_next(hash_node)) {
+		struct file_entry *fe;
+		struct file *file;
+
+		fe = rb_entry(hash_node, struct file_entry, n);
+
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+			break;
+		}
+
+		print_hash((uint64_t *)fe->hash);
+		printf("\n");
+		printf("\t%s: pos=%lu, phys=%lu, len=%lu\n", file->name,
+		       fe->pos, fe->physical, fe->len);
+
+		while (fe->next != NULL) {
+			fe = fe->next;
+			file = file_tree_search(fe->ino);
+			if (!file) {
+				fprintf(stderr, "couldnt find file, weird\n");
+				continue;
+			}
+
+			printf("\t%s: pos=%lu, len=%lu\n", file->name,
+			       fe->pos, fe->len);
+		}
+	}
+}
+
+
+static void do_dedup()
+{
+	struct rb_node *node;
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_file_extent_info *info;
+	size_t size;
+	int allocated_entries = 1;
+
+	size = sizeof(*args) + sizeof(*info);
+	args = malloc(size);
+	if (!args) {
+		fprintf(stderr, "error allocating ioctl arguments\n");
+		return;
+	}
+
+	memset(args, 0, size);
+
+	for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+		struct file_entry *fe, *orig;
+		struct file *file;
+		int fd;
+		int ret;
+
+		orig = fe = rb_entry(node, struct file_entry, n);
+
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+			break;
+		}
+
+		fd = open(file->name, O_WRONLY);
+		if (fd < 0) {
+			fprintf(stderr, "error opening file (%s) for dedup: %s\n",
+				file->name, strerror(errno));
+			continue;
+		}
+
+		if (fe->entries > allocated_entries) {
+			allocated_entries = fe->entries;
+			size = sizeof(*args) +
+				(sizeof(*info) * allocated_entries);
+			args = realloc(args, size);
+			if (!args) {
+				fprintf(stderr, "error allocating ioctl "
+					"arguments\n");
+				return;
+			}
+			memset(args, 0, size);
+		}
+
+		args->total_files = 0;
+		args->logical_offset = fe->pos;
+		args->length = fe->len;
+		args->hash_len = digest_len;
+		args->hash_type = BTRFS_SAME_EXTENT_HASH_SHA256;
+		args->hash = fe->hash;
+
+		info = &args->info[0];
+		while (fe->next != NULL) {
+			fe = fe->next;
+			file = file_tree_search(fe->ino);
+			if (!file) {
+				fprintf(stderr, "couldnt find file, weird\n");
+				continue;
+			}
+
+			fe->fd = info->fd = open(file->name, O_WRONLY);
+			if (info->fd < 0) {
+				fprintf(stderr, "error opening file (%s) for "
+					"dedup: %s\n", file->name,
+					strerror(errno));
+				continue;
+			}
+			info->logical_offset = fe->pos;
+			info++;
+			args->total_files++;
+		}
+
+		if (args->total_files == 0) {
+			close(fd);
+			continue;
+		}
+
+		ret = ioctl(fd, BTRFS_IOC_FILE_EXTENT_SAME, args);
+		if (ret)
+			fprintf(stderr, "failed to do extent same %d\n",
+				errno);
+
+		fe = orig;
+		while (fe->next != NULL) {
+			fe = fe->next;
+			if (fe->fd >= 0)
+				close(fe->fd);
+		}
+		close(fd);
+	}
+}
+
+static void scan_dir(char *dirname)
+{
+	DIR *dir;
+	struct dirent *dirent;
+
+	dir = opendir(dirname);
+	if (!dir) {
+		fprintf(stderr, "failed to open dir %s: %d\n", dirname, errno);
+		return;
+	}
+
+	while (((dirent = readdir(dir)) != NULL)) {
+		char name[PATH_MAX];
+		if (dirent->d_type == DT_DIR) {
+			if (dirent->d_name[0] == ''.'')
+				continue;
+			snprintf(name, PATH_MAX, "%s/%s", dirname,
+				 dirent->d_name);
+			scan_dir(name);
+		} else if (dirent->d_type == DT_REG) {
+			snprintf(name, PATH_MAX, "%s/%s", dirname,
+				 dirent->d_name);
+			hash_file(name);
+		}
+	}
+
+	closedir(dir);
+}
+
+int main(int argc, char **argv)
+{
+	if (argc < 2) {
+		fprintf(stderr, "dedup <directory>\n");
+		exit(1);
+	}
+
+	if (!gcry_check_version(GCRYPT_VERSION)) {
+		fprintf(stderr, "libgcrypt version mismatch\n");
+		exit(1);
+	}
+
+	gcry_control(GCRYCTL_DISABLE_SECMEM, 0);
+	gcry_control(GCRYCTL_INITIALIZATION_FINISHED, 0);
+
+	digest_len = gcry_md_get_algo_dlen(GCRY_MD_SHA256);
+	digest = malloc(digest_len);
+	if (!digest) {
+		fprintf(stderr, "failed to alloc digest\n");
+		exit(1);
+	}
+
+	hash_tree = RB_ROOT;
+	file_tree = RB_ROOT;
+
+	scan_dir(argv[1]);
+
+	print_stats();
+
+	prune_entries();
+
+	print_hashes();
+
+	do_dedup();
+
+	free_hash_tree();
+	free_file_tree();
+	free(digest);
+
+	return 0;
+}
diff --git a/ioctl.h b/ioctl.h
index 4ace64f..646837b 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -138,6 +138,23 @@ struct btrfs_ioctl_disk_info_args {
 	__u64 devices[0];
 };
 
+#define BTRFS_SAME_EXTENT_HASH_SHA256	1
+
+struct btrfs_ioctl_file_extent_info {
+	__s64 fd;
+	__u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;
+	__u64 length;
+	__u64 total_files;
+	__u32 hash_len;
+	__u8 hash_type;
+	char *hash;
+	struct btrfs_ioctl_file_extent_info info[0];
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -177,4 +194,6 @@ struct btrfs_ioctl_disk_info_args {
 				    struct btrfs_ioctl_space_args)
 #define BTRFS_IOC_DISK_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21, \
 				  struct btrfs_ioctl_disk_info_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+					struct btrfs_ioctl_same_args)
 #endif
-- 
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-05 17:42 UTC

head link

Re: Offline Deduplication for Btrfs

Josef Bacik wrote:
> Basically I think online dedup is huge waste of time and completely
useless.
I couldn''t disagree more. First, let''s consider what is the 
general-purpose use-case of data deduplication. What are the resource 
requirements to perform it? How do these resource requirements differ 
between online and offline?

The only sane way to keep track of hashes of existing blocks is using an 
index. Searches through an index containing evenly distributed data 
(such as hashes) is pretty fast (log(N)), and this has to be done 
regardless of whether the dedupe is online or offline. It also goes 
without saying that all the blocks being deduplicated need to be hashed, 
and the cost of this is also the same whether the block is hashes online 
or offline.

Let''s look at the relative merits:

1a) Offline
We have to copy the entire data set. This means we are using the full 
amount of disk writes that the data set size dictates. Do we do the 
hashing of current blocks at this point to create the indexes? Or do we 
defer it until some later time?

Doing it at the point of writes is cheaper - we already have the data in 
RAM and we can calculate the hashes as we are writing each block. 
Performance implications of this are fairly analogous to the parity RAID 
RMW performance issue - to achieve decent performance you have to write 
the parity at the same time as the rest of the stripe, otherwise you 
have to read the part of the stripe you didn''t write, before
calculating
the checksum.

So by doing the hash indexing offline, the total amount of disk I/O 
required effectively doubles, and the amount of CPU spent on doing the 
hashing is in no way reduced.

How is this in any way advantageous?

1b) Online
As we are writing the data, we calculate the hashes for each block. (See 
1a for argument of why I believe this is saner and cheaper than doing it 
offline.) Since we already have these hashes, we can do a look-up in the 
hash-index, and either write out the block as is (if that hash isn''t 
already in the index) or simply write the pointer to an existing 
suitable block (if it already exists). This saves us writing out that 
block - fewer writes to the disk, not to mention we don''t later have to
re-read the block to dedupe it.

So in this case, instead of write-read-relink of the offline scenario, 
we simply do relink on duplicate blocks.

There is another reason to favour the online option due to it''s lower 
write stress - SSDs. Why hammer the SSD with totally unnecessary writes?

The _only_ reason to defer deduping is that hashing costs CPU time. But 
the chances are that a modern CPU core can churn out MD5 and/or SHA256 
hashes faster than a modern mechanical disk can keep up. A 15,000rpm 
disk can theoretically handle 250 IOPS. A modern CPU can handle 
considerably more than 250 block hashings per second. You could argue 
that this changes in cases of sequential I/O on big files, but a 1.86GHz 
GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will 
struggle to keep up with.

I don''t think that the realtime performance argument withstands
scrutiny.
> You are going to want to do different things with different data.  For
example,
> for a mailserver you are going to want to have very small blocksizes, but
for
> say a virtualization image store you are going to want much larger
blocksizes.
> And lets not get into heterogeneous environments, those just get much too
> complicated.
In terms of deduplication, IMO it should really all be uniform, 
transparent, and block based. In terms of specifying which subtrees to 
dedupe, that should really be a per subdirectory hereditary attribute, 
kind of like compression was supposed to work with chattr +c in the past.
> So my solution is batched dedup, where a user just runs this
> command and it dedups everything at this point.  This avoids the very
costly
> overhead of having to hash and lookup for duplicate extents online and lets
us
> be _much_ more flexible about what we want to deduplicate and how we want
to do
> it.
I don''t see that it adds any flexibility compared to the hereditary 
deduping attribute. I also don''t see that it is any cheaper.
It''s
actually more expensive, according to the reasoning above.

As an aside, zfs and lessfs both do online deduping, presumably for a 
good reason.

Then again, for a lot of use-cases there are perhaps better ways to 
achieve the targed goal than deduping on FS level, e.g. snapshotting or 
something like fl-cow:
http://www.xmailserver.org/flcow.html

Personally, I would still like to see a fl-cow like solution that 
actually preserves the inode numbers of duplicate files while providing 
COW functionality that breaks this unity (and inode number identity) 
upon writes, specifically because it saves page cache (only have to 
cache one copy) and in case of DLLs on chroot style virtualization 
(OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are 
all mapped into the same memory, thus yielding massive memory savings on 
machines with a lot of VMs.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Simon Farnsworth

2011-Jan-05 17:50 UTC

head link

Re: [PATCH] Btrfs: add extent-same ioctl for dedup

Josef Bacik wrote:
> This adds the ability for userspace to tell btrfs which extents match
> eachother. You pass in
> 
> -a logical offset
> -a length
> -a hash type (currently only sha256 is supported)
> -the hash
> -a list of file descriptors with their logical offset
> 
> and this ioctl will split up the extent on the target file and then link
> all of
> the files with the target files extent and free up their original extent. 
> The hash is to make sure nothing has changed between the userspace app
> running and we doing the actual linking, so we hash everything to make
> sure it''s all still
> the same.  This doesn''t work in a few key cases
> 
> 1) Any data transformation whatsoever.  This includes compression or any
> encryption that happens later on.  This is just to make sure we''re
not
> deduping things that don''t turn out to be the same stuff on disk
as it is
> uncompressed/decrypted.
> 
> 2) Across subvolumes.  This can be fixed later, but this is just to keep
> odd problems from happening, like oh say trying to dedup things that are
> snapshots
> of eachother already.  Nothing bad will happen, it''s just needless
work so
> just don''t allow it for the time being.
> 
> 3) If the target file''s data is split across extents.  We need one
extent
> to point everybody at, so if the target file''s data spans
different
> extents we
> won''t work.  In this case I return ERANGE so the userspace app can
call
> defrag and then try again, but currently I don''t do that, so that
will
> have to be fixed at some point.
> 
> I think thats all of the special cases.  Thanks,
> I''m going to ask the stupid question: What happens if an attacker user
can
race against the dedupe process?

In particular, consider the following hypothetical scenario:

Attacker has discovered a hash collision for some important data that they 
can read but not write (e.g. /etc/passwd, /home/user/company-critical-
data.ods). They copy the important data from its original location to 
somewhere they can write to on the same filesystem.

Now for the evil bit; they wait, watching for the dedupe process to run. 
When it''s had time to verify hash and memcmp the data, but before it
calls
the ioctl, Attacker swaps the copy of the data under their control for the 
bad one with the hash collision.

If I''ve understood the code correctly, if Attacker''s version
of the data is
the source from the perspective of the ioctl, the kernel will hash the data, 
determine that the hash matches, not cross-check the entire extent with 
memcmp or equivalent, and will then splice Attacker''s version of data
into
the original file. If the collision merely lets Attacker trash the original, 
that''s bad enough; if it lets them put other interesting content in
place,
it''s a really bad problem.

-- 
Here''s hoping I simply missed something,

Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Diego Calleja

2011-Jan-05 18:41 UTC

head link

Re: Offline Deduplication for Btrfs

On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic
escribió:> So by doing the hash indexing offline, the total amount of disk I/O 
> required effectively doubles, and the amount of CPU spent on doing the 
> hashing is in no way reduced.
But there are people who might want to avoid temporally the extra cost
of online dedup, and do it offline when the server load is smaller.

In my opinion, both online and offline dedup have valid use cases, and
the best choice is probably implement both.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ray Van Dolson

2011-Jan-05 19:01 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja
wrote:> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
> > So by doing the hash indexing offline, the total amount of disk I/O 
> > required effectively doubles, and the amount of CPU spent on doing the
> > hashing is in no way reduced.
> 
> But there are people who might want to avoid temporally the extra cost
> of online dedup, and do it offline when the server load is smaller.
> 
> In my opinion, both online and offline dedup have valid use cases, and
> the best choice is probably implement both.
Question from an end-user.  When we say "offline" deduplication, are
we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file system
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?

Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-05 19:46 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 05, 2011 at 05:42:42PM +0000, Gordan Bobic
wrote:> Josef Bacik wrote:
>
>> Basically I think online dedup is huge waste of time and completely
useless.
>
> I couldn''t disagree more. First, let''s consider what is
the
> general-purpose use-case of data deduplication. What are the resource  
> requirements to perform it? How do these resource requirements differ  
> between online and offline?
>
> The only sane way to keep track of hashes of existing blocks is using an  
> index. Searches through an index containing evenly distributed data  
> (such as hashes) is pretty fast (log(N)), and this has to be done  
> regardless of whether the dedupe is online or offline. It also goes  
> without saying that all the blocks being deduplicated need to be hashed,  
> and the cost of this is also the same whether the block is hashes online  
> or offline.
>
> Let''s look at the relative merits:
>
> 1a) Offline
> We have to copy the entire data set. This means we are using the full  
> amount of disk writes that the data set size dictates. Do we do the  
> hashing of current blocks at this point to create the indexes? Or do we  
> defer it until some later time?
>
> Doing it at the point of writes is cheaper - we already have the data in  
> RAM and we can calculate the hashes as we are writing each block.  
> Performance implications of this are fairly analogous to the parity RAID  
> RMW performance issue - to achieve decent performance you have to write  
> the parity at the same time as the rest of the stripe, otherwise you  
> have to read the part of the stripe you didn''t write, before
calculating
> the checksum.
>
> So by doing the hash indexing offline, the total amount of disk I/O  
> required effectively doubles, and the amount of CPU spent on doing the  
> hashing is in no way reduced.
>
> How is this in any way advantageous?
>
> 1b) Online
> As we are writing the data, we calculate the hashes for each block. (See  
> 1a for argument of why I believe this is saner and cheaper than doing it  
> offline.) Since we already have these hashes, we can do a look-up in the  
> hash-index, and either write out the block as is (if that hash
isn''t
> already in the index) or simply write the pointer to an existing  
> suitable block (if it already exists). This saves us writing out that  
> block - fewer writes to the disk, not to mention we don''t later
have to
> re-read the block to dedupe it.
>
> So in this case, instead of write-read-relink of the offline scenario,  
> we simply do relink on duplicate blocks.
>
> There is another reason to favour the online option due to it''s
lower
> write stress - SSDs. Why hammer the SSD with totally unnecessary writes?
>
> The _only_ reason to defer deduping is that hashing costs CPU time. But  
> the chances are that a modern CPU core can churn out MD5 and/or SHA256  
> hashes faster than a modern mechanical disk can keep up. A 15,000rpm  
> disk can theoretically handle 250 IOPS. A modern CPU can handle  
> considerably more than 250 block hashings per second. You could argue  
> that this changes in cases of sequential I/O on big files, but a 1.86GHz  
> GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will  
> struggle to keep up with.
>
> I don''t think that the realtime performance argument withstands
scrutiny.
>
>> You are going to want to do different things with different data.  For
example,
>> for a mailserver you are going to want to have very small blocksizes,
but for
>> say a virtualization image store you are going to want much larger
blocksizes.
>> And lets not get into heterogeneous environments, those just get much
too
>> complicated.
>
> In terms of deduplication, IMO it should really all be uniform,  
> transparent, and block based. In terms of specifying which subtrees to  
> dedupe, that should really be a per subdirectory hereditary attribute,  
> kind of like compression was supposed to work with chattr +c in the past.
>
>> So my solution is batched dedup, where a user just runs this
>> command and it dedups everything at this point.  This avoids the very
costly
>> overhead of having to hash and lookup for duplicate extents online and
lets us
>> be _much_ more flexible about what we want to deduplicate and how we
want to do
>> it.
>
> I don''t see that it adds any flexibility compared to the
hereditary
> deduping attribute. I also don''t see that it is any cheaper.
It''s
> actually more expensive, according to the reasoning above.
>
> As an aside, zfs and lessfs both do online deduping, presumably for a  
> good reason.
>
> Then again, for a lot of use-cases there are perhaps better ways to  
> achieve the targed goal than deduping on FS level, e.g. snapshotting or  
> something like fl-cow:
> http://www.xmailserver.org/flcow.html
>
> Personally, I would still like to see a fl-cow like solution that  
> actually preserves the inode numbers of duplicate files while providing  
> COW functionality that breaks this unity (and inode number identity)  
> upon writes, specifically because it saves page cache (only have to  
> cache one copy) and in case of DLLs on chroot style virtualization  
> (OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are  
> all mapped into the same memory, thus yielding massive memory savings on  
> machines with a lot of VMs.
>
Blah blah blah, I''m not having an argument about which is better
because I
simply do not care.  I think dedup is silly to begin with, and online dedup even
sillier.  The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I''d save 300
megabytes.  I''ll say that again, with 107 gigabytes in use,
I''d save 300
megabytes.  So in the normal user case dedup would have been wholey useless to
me.

Dedup is only usefull if you _know_ you are going to have duplicate information,
so the two major usecases that come to mind are

1) Mail server.  You have small files, probably less than 4k (blocksize) that
you are storing hundreds to thousands of.  Using dedup would be good for this
case, and you''d have to have a small dedup blocksize for it to be
usefull.

2) Virtualized guests.  If you have 5 different RHEL5 virt guests, chances are
you are going to share data between them, but unlike with the mail server
example, you are likely to find much larger chunks that are the same, so
you''d
want a larger dedup blocksize, say 64k.  You want this because if you did just
4k you''d end up with a ridiculous amount of framentation and
performance would
go down the toilet, so you need a larger dedup blocksize to make for better
performance.

So you''d want an online implementation to give you a choice of dedup
blocksize,
which seems to me to be overly complicated.

And then lets bring up the fact that you _have_ to manually compare any data you
are going to dedup.  I don''t care if you think you have the greatest
hashing
algorithm known to man, you are still going to have collisions somewhere at some
point, so in order to make sure you don''t lose data, you have to
manually memcmp
the data.  So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you write.  That
is going to make your write performance _suck_.

Do I think offline dedup is awesome?  Hell no, but I got distracted doing it as
a side project so I figured I''d finish it, and I did it in under 1400
lines.  I
dare you to do the same with an online implementation.  Offline is simpler to
implement and simpler to debug if something goes wrong, and has an overall
easier to control impact on the system.

So there is an entirely too long of a response for something I didn''t
really
want to get into.  People are free to do an online implementation, and good luck
to them, but as for me I think it''s stupid and won''t be taking
it up anytime
soon.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lars Wirzenius

2011-Jan-05 19:58 UTC

head link

Re: Offline Deduplication for Btrfs

On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:> Blah blah blah, I''m not having an argument about which is better
because I
> simply do not care.  I think dedup is silly to begin with, and online dedup
even
> sillier.  The only reason I did offline dedup was because I was just toying
> around with a simple userspace app to see exactly how much I would save if
I did
> dedup on my normal system, and with 107 gigabytes in use, I''d save
300
> megabytes.  I''ll say that again, with 107 gigabytes in use,
I''d save 300
> megabytes.  So in the normal user case dedup would have been wholey useless
to
> me.
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That''s assuming no MD5 hash
collisions.) I don''t have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.

(For my script, see find-duplicate-chunks in
http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/".
http://braawi.org/obnam/ is the home page of the backup app.)

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Freddie Cash

2011-Jan-05 20:12 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik <josef@redhat.com>
wrote:> Dedup is only usefull if you _know_ you are going to have duplicate
information,
> so the two major usecases that come to mind are
>
> 1) Mail server.  You have small files, probably less than 4k (blocksize)
that
> you are storing hundreds to thousands of.  Using dedup would be good for
this
> case, and you''d have to have a small dedup blocksize for it to be
usefull.
>
> 2) Virtualized guests.  If you have 5 different RHEL5 virt guests, chances
are
> you are going to share data between them, but unlike with the mail server
> example, you are likely to find much larger chunks that are the same, so
you''d
> want a larger dedup blocksize, say 64k.  You want this because if you did
just
> 4k you''d end up with a ridiculous amount of framentation and
performance would
> go down the toilet, so you need a larger dedup blocksize to make for better
> performance.
You missed out on the most obvious, and useful, use case for dedupe:
  central backups server.

Our current backup server does an rsync backup of 127 servers every
night into a single ZFS pool.  90+ of those servers are identical
Debian installs (school servers), 20-odd of those are identical
FreeBSD installs (firewalls/routers), and the rest are mail/web/db
servers (Debian, Ubuntu, RedHat, Windows).

Just as a test, we copied a week of backups to a Linux box running
ZFS-fuse with dedupe enabled, and had a combined dedupe/compress
ration in the low double-digits (11 or 12x, something like that).  Now
we''re just waiting for ZFSv22+ to hit FreeBSD to enable dedupe on the
backups server.

For backups, and central storage for VMs, online dedupe is a massive
win.  Offline, maybe.  Either way, dedupe is worthwhile.

-- 
Freddie Cash
fjwcash@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-05 20:15 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 05, 2011 at 07:58:13PM +0000, Lars Wirzenius
wrote:> On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
> > Blah blah blah, I''m not having an argument about which is
better because I
> > simply do not care.  I think dedup is silly to begin with, and online
dedup even
> > sillier.  The only reason I did offline dedup was because I was just
toying
> > around with a simple userspace app to see exactly how much I would
save if I did
> > dedup on my normal system, and with 107 gigabytes in use, I''d
save 300
> > megabytes.  I''ll say that again, with 107 gigabytes in use,
I''d save 300
> > megabytes.  So in the normal user case dedup would have been wholey
useless to
> > me.
> 
> I have been thinking a lot about de-duplication for a backup application
> I am writing. I wrote a little script to figure out how much it would
> save me. For my laptop home directory, about 100 GiB of data, it was a
> couple of percent, depending a bit on the size of the chunks. With 4 KiB
> chunks, I would save about two gigabytes. (That''s assuming no MD5
hash
> collisions.) I don''t have VM images, but I do have a fair bit of
saved
> e-mail. So, for backups, I concluded it was worth it to provide an
> option to do this. I have no opinion on whether it is worthwhile to do
> in btrfs.
> 
Yeah for things where you are talking about sending it over the network or
something like that every little bit helps.  I think deduplication is far more
interesting and usefull at an application level than at a filesystem level.  For
example with a mail server, there is a good chance that the files will be
smaller than a blocksize and not be able to be deduped, but if the application
that was storing them recognized that it had the same messages and just linked
everything in its own stuff then that would be cool.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-05 20:25 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/05/2011 06:41 PM, Diego Calleja wrote:> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
>> So by doing the hash indexing offline, the total amount of disk I/O
>> required effectively doubles, and the amount of CPU spent on doing the
>> hashing is in no way reduced.
>
> But there are people who might want to avoid temporally the extra cost
> of online dedup, and do it offline when the server load is smaller.
The point is that the offline dedup is actually twice as expensive, and 
the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is 
very limited today, compared to CPU time.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-05 20:27 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/05/2011 07:01 PM, Ray Van Dolson wrote:> On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
>> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
>>> So by doing the hash indexing offline, the total amount of disk I/O
>>> required effectively doubles, and the amount of CPU spent on doing
the
>>> hashing is in no way reduced.
>>
>> But there are people who might want to avoid temporally the extra cost
>> of online dedup, and do it offline when the server load is smaller.
>>
>> In my opinion, both online and offline dedup have valid use cases, and
>> the best choice is probably implement both.
>
> Question from an end-user.  When we say "offline" deduplication,
are we
> talking about post-process deduplication (a la what Data ONTAP does
> with their SIS implementation) during which the underlying file system
> data continues to be available, or a process that needs exclusive
> access ot the blocks to do its job?
I was assuming it was a regular cron-job that grinds away on the disks 
but doesn''t require downtime.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-05 20:28 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson
wrote:> On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
> > On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
> > > So by doing the hash indexing offline, the total amount of disk
I/O
> > > required effectively doubles, and the amount of CPU spent on
doing the
> > > hashing is in no way reduced.
> > 
> > But there are people who might want to avoid temporally the extra cost
> > of online dedup, and do it offline when the server load is smaller.
> > 
> > In my opinion, both online and offline dedup have valid use cases, and
> > the best choice is probably implement both.
> 
> Question from an end-user.  When we say "offline" deduplication,
are we
> talking about post-process deduplication (a la what Data ONTAP does
> with their SIS implementation) during which the underlying file system
> data continues to be available, or a process that needs exclusive
> access ot the blocks to do its job?
> 
Yeah its just a post-process thing, you run it when you care to run it and it
doesn''t make anything unavailable.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Freddie Cash

2011-Jan-05 20:34 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik <josef@redhat.com>
wrote:> Yeah for things where you are talking about sending it over the network or
> something like that every little bit helps.  I think deduplication is far
more
> interesting and usefull at an application level than at a filesystem level.
 For
> example with a mail server, there is a good chance that the files will be
> smaller than a blocksize and not be able to be deduped, but if the
application
> that was storing them recognized that it had the same messages and just
linked
> everything in its own stuff then that would be cool.  Thanks,
Cyrus IMAP and Zimbra (probably a lot of others) already do that,
hard-linking identical message bodies.  The e-mail server use-case is
for dedupe is pretty much covered already.


-- 
Freddie Cash
fjwcash@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-05 20:46 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/05/2011 07:46 PM, Josef Bacik wrote:
> Blah blah blah, I''m not having an argument about which is better
because I
> simply do not care.  I think dedup is silly to begin with, and online dedup
even
> sillier.
Offline dedup is more expensive - so why are you of the opinion that it 
is less silly? And comparison by silliness quotiend still sounds like an 
argument over which is better.
> The only reason I did offline dedup was because I was just toying
> around with a simple userspace app to see exactly how much I would save if
I did
> dedup on my normal system, and with 107 gigabytes in use, I''d save
300
> megabytes.  I''ll say that again, with 107 gigabytes in use,
I''d save 300
> megabytes.  So in the normal user case dedup would have been wholey useless
to
> me.
Dedup isn''t for an average desktop user. Dedup is for backup storage
and
virtual images. I don''t remember anyone ever saying it is for the 
average desktop user. I am amazed you got that much saving even - I 
wouldn''t expect there to be any duplicate files on a normal system. 
Compression is a feature that the desktop users would benefit with, not 
deduplication.
> Dedup is only usefull if you _know_ you are going to have duplicate
information,
> so the two major usecases that come to mind are
>
> 1) Mail server.  You have small files, probably less than 4k (blocksize)
that
> you are storing hundreds to thousands of.  Using dedup would be good for
this
> case, and you''d have to have a small dedup blocksize for it to be
usefull.
Explain to me why you think this would yield duplicate blocks. If your 
server is Maildir, headers will be in the mail files, and because all 
emails went to different users, they''d have different headers, and thus
not be dedupable.
> 2) Virtualized guests.  If you have 5 different RHEL5 virt guests, chances
are
> you are going to share data between them, but unlike with the mail server
> example, you are likely to find much larger chunks that are the same, so
you''d
> want a larger dedup blocksize, say 64k.  You want this because if you did
just
> 4k you''d end up with a ridiculous amount of framentation and
performance would
> go down the toilet, so you need a larger dedup blocksize to make for better
> performance.
Fragmentation will cause you problems anyway, the argument in the UNIX 
world since year dot was that defragging doesn''t make a damn worth of 
difference when you have a hundred users hammering away on a machine 
that has to skip between all their collective files.

If you have VM image files a-la vmware/xen/kvm, then using blocks of the 
same size as the guests is the only way that you are going to get sane 
deduplication performance. Otherwise the blocks won''t line up. If the 
dedupe block size is 4KB and guest fs block size is 4KB, that''s a 
reasonably clean case.

The biggest win by far, however, would be when using chroot type guests, 
as I mentioned.
> So you''d want an online implementation to give you a choice of
dedup blocksize,
> which seems to me to be overly complicated.
I''d just make it always use the fs block size. No point in making it 
variable.
> And then lets bring up the fact that you _have_ to manually compare any
data you
> are going to dedup.  I don''t care if you think you have the
greatest hashing
> algorithm known to man, you are still going to have collisions somewhere at
some
> point, so in order to make sure you don''t lose data, you have to
manually memcmp
> the data.  So if you are doing this online, that means reading back the
copy you
> want to dedup in the write path so you can do the memcmp before you write. 
That
> is going to make your write performance _suck_.
IIRC, this is configurable in ZFS so that you can switch off the 
physical block comparison. If you use SHA256, the probability of a 
collission (unless SHA is broken, in which case we have much bigger 
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 
Exabytes. That''s one trillion trillion (that''s double
trillion)
Exabytes. That is considerably more storage space than there is likely 
to be available on the planet for some time. And just for good measure, 
you could always up the hash to SHA512 or use two different hashes (e.g. 
a combination of SHA256 and MD5).
> Do I think offline dedup is awesome?  Hell no, but I got distracted doing
it as
> a side project so I figured I''d finish it, and I did it in under
1400 lines.  I
> dare you to do the same with an online implementation.  Offline is simpler
to
> implement and simpler to debug if something goes wrong, and has an overall
> easier to control impact on the system.
It is also better done outside the FS if you''re not going to do it 
properly using FL-COW or fuse based lessfs.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lars Wirzenius

2011-Jan-05 21:07 UTC

head link

Re: Offline Deduplication for Btrfs

On ke, 2011-01-05 at 19:58 +0000, Lars Wirzenius wrote:> (For my script, see find-duplicate-chunks in
> http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
> current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/".
> http://braawi.org/obnam/ is the home page of the backup app.)
If I may add: it would perhaps be good to collect numbers on the amount
of duplication (for various block sizes) there is on different kinds of
systems: random laptops, file servers for small companies and large
companies, mail servers, backup servers, VM servers, etc. Would anyone
be interested in collecting such numbers? A script like mine would be a
bit heavy to run, but not too much so, I bet.

It would be good to have hard numbers as a basis of discussion rather
than guesses and assumptions.

Or perhaps someone''s already collected the data?

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Diego Calleja

2011-Jan-05 21:14 UTC

head link

Re: Offline Deduplication for Btrfs

On Miércoles, 5 de Enero de 2011 21:25:41 Gordan Bobic
escribió:> The point is that the offline dedup is actually twice as expensive, and 
> the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is 
> very limited today, compared to CPU time.
And my point is:
> But there are people who might want to avoid temporally the extra cost
> of online dedup, and do it offline when the server load is smaller.
In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication, and the useless resource waste
offsets the other advantages. With offline dedup, you only deduplicate
when the system really needs it.

And I can also imagine some unrealistic but theorically valid cases,
like for example an embedded device that for some weird reason needs
deduplication but doesn''t want online dedup because it needs to save
as much power as possible. But it can run an offline dedup when the
batteries are charging.

It''s clear to me that if you really want a perfect deduplication
solution you need both systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-05 21:21 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/05/2011 09:14 PM, Diego Calleja wrote:> In fact, there are cases where online dedup is clearly much worse. For
> example, cases where people suffer duplication, but it takes a lot of
> time (several months) to hit it. With online dedup, you need to enable
> it all the time to get deduplication, and the useless resource waste
> offsets the other advantages. With offline dedup, you only deduplicate
> when the system really needs it.
My point is that on a file server you don''t need to worry about the CPU
cost of deduplication because you''ll run out of I/O long before you run
out of CPU.
> And I can also imagine some unrealistic but theorically valid cases,
> like for example an embedded device that for some weird reason needs
> deduplication but doesn''t want online dedup because it needs to
save
> as much power as possible. But it can run an offline dedup when the
> batteries are charging.
That''s _very_ theoretical.
> It''s clear to me that if you really want a perfect deduplication
> solution you need both systems.
I''m not against having both. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 01:03 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/06/2011 12:22 AM, Spelic wrote:> On 01/05/2011 09:46 PM, Gordan Bobic wrote:
>> On 01/05/2011 07:46 PM, Josef Bacik wrote:
>>
>> Offline dedup is more expensive - so why are you of the opinion that
>> it is less silly? And comparison by silliness quotiend still sounds
>> like an argument over which is better.
>>
>
> If I can say my opinion, I wouldn''t want dedup to be enabled
online for
> the whole filesystem.
>
> Three reasons:
>
> 1- Virtual machine disk images should not get deduplicated imho, if you
> care about performances, because fragmentation is more important in that
> case.
I disagree. You''ll gain much, much more from improved caching and 
reduced page cache usage than you''ll lose from fragmentation.
> So offline dedup is preferable IMHO. Or at least online dedup should
> happen only on configured paths.
Definitely agree that it should be a per-directory option, rather than 
per mount.
> 2- I don''t want performances to drop all the time. I would run
dedup
> periodically on less active hours, hence, offline. A rate limiter should
> also be implemented so not to trash the drives too much. Also a stop and
> continue should be implemented, so that dedup which couldn''t
finish
> within a certain time-frame (e.g. one night) can be made continue the
> night after without restarting from the beginning.
This is the point I was making - you end up paying double the cost in 
disk I/O and the same cost in CPU terms if you do it offline. And I am 
not convniced the overhead of calculating checksums is that great. There 
are already similar overheads in checksums being calculated to enable 
smart data recovery in case of silent disk corruption.

Now that I mentioned, that, it''s an interesting point. Could these be 
unified? If we crank up the checksums on files a bit, to something 
suitably useful for deduping, it could make the deduping feature almost 
free.

As for restarting deduping (e.g. you chattr -R a directory to specify it 
for deduping. Since the contents aren''t already deduped (the
files''
entries aren''t in the hash index, it''d be obvious what still
needs to be
deduped and what doesn''t.
> 3- Only some directories should be deduped, for performance reasons. You
> can foresee where duplicate blocks can exist and where not. Backup
> directories typically, or mailservers directories. The rest is probably
> a waste of time.
Indeed, see above. I think it should be a per file setting/attribute, 
hereditary from the parent directory.
> Also, the OS is small even if identical on multiple virtual images, how
> much is going to occupy anyway? Less than 5GB per disk image usually.
> And that''s the only thing that would be deduped because data
likely to
> be different on each instance. How many VMs running you have? 20?
That''s
> at most 100GB saved one-time at the cost of a lot of fragmentation.
That''s also 100GB fewer disk blocks in contention for page cache. If 
you''re hitting the disks, you''re already going to slow down by
several
orders of magnitude. Better to make the caching more effective.
>>> So if you are doing this online, that means reading back the copy
you
>>> want to dedup in the write path so you can do the memcmp before you
>>> write. That
>>> is going to make your write performance _suck_.
>>
>> IIRC, this is configurable in ZFS so that you can switch off the
>> physical block comparison. If you use SHA256, the probability of a
>> collission (unless SHA is broken, in which case we have much bigger
>> problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
>> Exabytes. That''s one trillion trillion (that''s double
trillion) Exabytes.
>
> I like mathematics, but I don''t care this time. I would never
enable
> dedup without full blocks compare. I think most users and most companies
> would do the same.
I understand where you are coming from, but by that reasoning you could 
also argue that AES256 isn''t good enough to keep your data
confidential.
It is a virtual certainty that you will lose several times that much 
data through catastrophic disk+raid+backup failures than through finding 
a hash collission.
> If there is full blocks compare, a simpler/faster algorithm could be
> chosen, like md5. Or even a md-64bits which I don''t think it
exists, but
> you can take MD4 and then xor the first 8 bytes with the second 8 bytes
> so to reduce it to 8 bytes only. This is just because it saves 60% of
> the RAM occupation during dedup, which is expected to be large, and the
> collisions are still insignificant at 64bits. Clearly you need to do
> full blocks compare after that.
I really don''t think the cost in terms of a few bytes per file for the 
hashes is that significant.
> Note that deduplication IS a cryptographically sensitive matter because
> if sha-1 is cracked, people can nuke (or maybe even alter, and with
> this, hack privileges) other users'' files by providing blocks with
the
> same SHA and waiting for dedup to pass.
> Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
> SHA1 and AES are two wrong standards...
That''s just alarmist. AES is being cryptanalyzed because everything
uses
it. And the news of it''s insecurity are somewhat exaggerated (for now
at
least).
> Dedup without full blocks compare seems indeed suited for online dedup
> (which I wouldn''t enable, now for one more reason) because with
full
> block compares performances would really suck. But please leave full
> blocks compare for the offline dedup.
Actually, even if you are doing full block compares, online would still 
be faster, because at least one copy will already be in page cache, 
ready to hash. Online you get checksum+read+write, offline you get 
read+checksum+read+write. You still end up 1/3 ahead in terms if IOPS 
required.
> Also I could suggest a third type of deduplication, but this is
> harder... it''s a file-level deduplication which works like xdelta,
that
> is, it is capable to recognize piece of identical data on two files,
> which are not at the same offset and which are not even aligned at block
> boundary. For this, a rolling hash like the one of rsync, or the xdelta
> 3.0 algorithm could be used. For this to work I suppose Btrfs needs to
> handle the padding of filesystem blocks... which I''m not sure it
was
> foreseen.
I think you''ll find this is way too hard to do sensibly. You are almost
doing a rzip pass over the whole file system. I don''t think
it''s really
workable.
> Above in this thread you said:
>> The _only_ reason to defer deduping is that hashing costs CPU time.
>> But the chances are that a modern CPU core can churn out MD5 and/or
>> SHA256 hashes faster than a modern mechanical disk can keep up. A
>> 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can
>> handle considerably more than 250 block hashings per second. You could
>> argue that this changes in cases of sequential I/O on big files, but a
>> 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs
>> will struggle to keep up with.
>
> A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
> A SSD can do more like 200MB/sec write 280MB/sec read sequential or
> random and is actually limited only by the SATA 3.0gbit/sec but soon
> enough they will have SATA/SAS 6.0gbit/sec.
But if you are spewing that much sequential data all the time, your 
workload is highly unusual, not to mention that those SSDs won''t last a
year. And if you are streaming live video or have a real-time data 
logging application that generates that much data, the chances are that 
yuo won''t have gained anything from deduping anyway. I don''t
think it''s
a valid use case, at least until you can come up with at least a 
remotely realistic scenario where you might get plausible benefit from 
deduping in terms of space savings that involves sequentially streaming 
data to disk at full speed.
> More cores can be used for hashing but multicore implementation for
> stuff that is not natively threaded (such as parallel and completely
> separate queries to a DB) usually is very difficult to do well. E.g. it
> was attempted recently on MD raid for parity computation by
> knowledgeable people but it performed so much worse than single-core
> that it was disabled.
You''d need a very fat array for one core to be unable to keep up. 
According to my dmesg, RAID5 checksumming on the box on my desk tops out 
at 1.8GB/s, and RAID6 at 1.2GB/s. That''s a lot of disks'' worth
of
bandwidth to have in an array, and that''s assuming large, streaming 
writes that can be handled efficiently. In reality, on smaller writes 
you''ll find you are severely bottlenecked by disk seek times, even if 
you have carefully tuned your MD and file system parameters to perfection.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Spelic

2011-Jan-06 01:19 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/05/2011 09:46 PM, Gordan Bobic wrote:> On 01/05/2011 07:46 PM, Josef Bacik wrote:
>
> Offline dedup is more expensive - so why are you of the opinion that 
> it is less silly? And comparison by silliness quotiend still sounds 
> like an argument over which is better.
>
If I can say my opinion, I wouldn''t want dedup to be enabled online for
the whole filesystem.

Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if you 
care about performances, because fragmentation is more important in that 
case.
So offline dedup is preferable IMHO. Or at least online dedup should 
happen only on configured paths.

2- I don''t want performances to drop all the time. I would run dedup 
periodically on less active hours, hence, offline. A rate limiter should 
also be implemented so not to trash the drives too much. Also a stop and 
continue should be implemented, so that dedup which couldn''t finish 
within a certain time-frame (e.g. one night) can be made continue the 
night after without restarting from the beginning.

3- Only some directories should be deduped, for performance reasons. You 
can foresee where duplicate blocks can exist and where not. Backup 
directories typically, or mailservers directories. The rest is probably 
a waste of time.
> Dedup isn''t for an average desktop user. Dedup is for backup
storage
> and virtual images
Not virtual images imho, for the reason above.

Also, the OS is small even if identical on multiple virtual images, how 
much is going to occupy anyway? Less than 5GB per disk image usually. 
And that''s the only thing that would be deduped because data likely to 
be different on each instance. How many VMs running you have? 20?
That''s
at most 100GB saved one-time at the cost of a lot of fragmentation.

Now if you backup those images periodically, that''s a place where I 
would run dedup.
> I''d just make it always use the fs block size. No point in making
it
> variable.
Agreed. What is the reason for variable block size?
>> And then lets bring up the fact that you _have_ to manually compare 
>> any data you
>> are going to dedup.  I don''t care if you think you have the
greatest
>> hashing
>> algorithm known to man, you are still going to have collisions 
>> somewhere at some
>> point, so in order to make sure you don''t lose data, you have
to
>> manually memcmp
>> the data.
Totally agreed
>> So if you are doing this online, that means reading back the copy you
>> want to dedup in the write path so you can do the memcmp before you 
>> write.  That
>> is going to make your write performance _suck_.
>
> IIRC, this is configurable in ZFS so that you can switch off the 
> physical block comparison. If you use SHA256, the probability of a 
> collission (unless SHA is broken, in which case we have much bigger 
> problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 
> Exabytes. That''s one trillion trillion (that''s double
trillion) Exabytes.
I like mathematics, but I don''t care this time. I would never enable 
dedup without full blocks compare. I think most users and most companies 
would do the same.

If there is full blocks compare, a simpler/faster algorithm could be 
chosen, like md5. Or even a md-64bits which I don''t think it exists,
but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes 
so to reduce it to 8 bytes only. This is just because it saves 60% of 
the RAM occupation during dedup, which is expected to be large, and the 
collisions are still insignificant at 64bits. Clearly you need to do 
full blocks compare after that.

BTW if you want to allow (as an option) dedup without full blocks 
compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has 
problems, I almost wouldn''t suggest it for cryptographically secure 
stuff foreseeing the future. Use ripemd160 or even better ripemd256 
which is even faster according to 
http://www.cryptopp.com/benchmarks.html  ripemds are much better 
algorithms than shas: they have no known weaknesses.
Note that deduplication IS a cryptographically sensitive matter because 
if sha-1 is cracked, people can nuke (or maybe even alter, and with 
this, hack privileges) other users'' files by providing blocks with the 
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...

Dedup without full blocks compare seems indeed suited for online dedup 
(which I wouldn''t enable, now for one more reason) because with full 
block compares performances would really suck. But please leave full 
blocks compare for the offline dedup.

Also I could suggest a third type of deduplication, but this is 
harder... it''s a file-level deduplication which works like xdelta, that
is, it is capable to recognize piece of identical data on two files, 
which are not at the same offset and which are not even aligned at block 
boundary. For this, a rolling hash like the one of rsync, or the xdelta 
3.0 algorithm could be used. For this to work I suppose Btrfs needs to 
handle the padding of filesystem blocks... which I''m not sure it was 
foreseen.

Above in this thread you said:> The _only_ reason to defer deduping is that hashing costs CPU time. 
> But the chances are that a modern CPU core can churn out MD5 and/or 
> SHA256 hashes faster than a modern mechanical disk can keep up. A 
> 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can 
> handle considerably more than 250 block hashings per second. You could 
> argue that this changes in cases of sequential I/O on big files, but a 
> 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs 
> will struggle to keep up with.
A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or 
random and is actually limited only by the SATA 3.0gbit/sec but soon 
enough they will have SATA/SAS 6.0gbit/sec.
More cores can be used for hashing but multicore implementation for 
stuff that is not natively threaded (such as parallel and completely 
separate queries to a DB) usually is very difficult to do well. E.g. it 
was attempted recently on MD raid for parity computation by 
knowledgeable people but it performed so much worse than single-core 
that it was disabled.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-06 01:29 UTC

head link

Re: Offline Deduplication for Btrfs

Excerpts from Gordan Bobic''s message of 2011-01-05 12:42:42
-0500:> Josef Bacik wrote:
> 
> > Basically I think online dedup is huge waste of time and completely
useless.
> 
> I couldn''t disagree more. First, let''s consider what is
the
> general-purpose use-case of data deduplication. What are the resource 
> requirements to perform it? How do these resource requirements differ 
> between online and offline?
I don''t really agree with Josef that dedup is dumb, but I do think his
current approach is the most reasonable.  Dedup has a few very valid use
cases, which I think break down to:

1) backups
2) VM images.

The backup farm use case is the best candidate for dedup in general
because they are generally write once and hopefully read never.
Fragmentation for reading doesn''t matter at all and we''re
really very
sure we''re going to backup the same files over and over again.

But, it''s also something that will be dramatically more efficient when
the backup server helps out.  The backup server knows two files have the
same name, same size and can guess with very high accuracy that they
will be the same.  So it is a very good candidate for Josef''s offline
dedup because it can just do the dedup right after writing the file.

In the backup farm, whole files are very likely to be identical, which
again is very easy to optimize with Josef''s approach.

Next is the VM images.  This is actually a much better workload for
online dedup, except for the part where our poor storage server would be
spending massive amounts of CPU deduping blocks for all the VMs on the
machine.  In this case the storage server doesn''t know the
filenames, it just has bunches of blocks that are likely to be the same
across VMs.

So, it seems a bit silly to do this out of band, where we wander through
the FS and read a bunch of blocks in hopes of finding ones with the same
hash.

But, one of the things on our features-to-implement page is to wander
through the FS and read all the blocks from time to time.  We want to do
this in the background to make sure the bits haven''t rotted on disk. 
By
scrubbing from time to time we are able to make sure that when a disk
does die, other disks in the FS are likely to have a good copy.

So again, Josef''s approach actually works very well.  His dedup util
could be the scrubbing util and we''ll get two projects for the price of
one.

As for the security of hashes, we''re unlikely to find a collision on a
sha256 that wasn''t made maliciously.  If the system''s data is
controlled and you''re not worried about evil people putting files on
there, extra reads really aren''t required.

But then again, extra reads are a good thing (see above about
scrubbing).  The complexity of the whole operation goes down
dramatically when we do the verifications because hash index corruptions
(this extent has this hash) will be found instead of blindly trusted.

None of this means that online dedup is out of the question, I just
think the offline stuff is a great way to start.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Spelic

2011-Jan-06 01:56 UTC

head link

Re: Offline Deduplication for Btrfs

On 01/06/2011 02:03 AM, Gordan Bobic wrote:>
> That''s just alarmist. AES is being cryptanalyzed because
everything
> uses it. And the news of it''s insecurity are somewhat exaggerated
(for
> now at least). 
Who cares... the fact of not being much used is a benefit for RIPEMD / 
blowfish-twofish then.
Nobody makes viruses for Linux because they target windows. Same thing...
RIPEMD has still an advantage over SHA imho, and blowfish over AES.

>
>> If there is full blocks compare, a simpler/faster algorithm could be
>> chosen, like md5. Or even a md-64bits which I don''t think it
exists, but
>> you can take MD4 and then xor the first 8 bytes with the second 8 bytes
>> so to reduce it to 8 bytes only. This is just because it saves 60% of
>> the RAM occupation during dedup, which is expected to be large, and the
>> collisions are still insignificant at 64bits. Clearly you need to do
>> full blocks compare after that.
>
> I really don''t think the cost in terms of a few bytes per file for
the
> hashes is that significant.
20 to 8 = 12 bytes per *filesystem block* saved, I think
Aren''t we talking about block-level deduplication?
For every TB of filesystem you occupy 2GB of RAM with hashes instead of 
5.3GB (I am assuming 4K blocks, I don''t remember how big are btrfs
blocks)
For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might 
be the edge between feasible and not feasible.
Actually it might not be feasible anyway... an option to store hashes 
into a ssd should be provided then...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Freddie Cash

2011-Jan-06 03:33 UTC

head link

Re: Offline Deduplication for Btrfs

On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic <gordan@bobich.net>
wrote:> On 01/06/2011 12:22 AM, Spelic wrote:
> Definitely agree that it should be a per-directory option, rather than per
> mount.
JOOC, would the dedupe "table" be done per directory, per mount, per
sub-volume, or per volume?  The larger the pool of data to check
against, the better your dedupe ratios will be.

I''m not up-to-date on all the terminology that btrfs uses, and how it
compares to ZFS (disks -> vdevs -> pool -> filesystem/volume), so the
terms above may be incorrect.  :)

In the ZFS world, dedupe is done pool-wide in that any block in the
pool is a candidate for dedupe, but the dedupe property can be
enabled/disabled on a per-filesystem basis.  Thus, only blocks in
filesystems with the dedupe property enabled will be deduped.  But
blocks from any filesystem can be compared against.
> This is the point I was making - you end up paying double the cost in disk
> I/O and the same cost in CPU terms if you do it offline. And I am not
> convniced the overhead of calculating checksums is that great. There are
> already similar overheads in checksums being calculated to enable smart
data
> recovery in case of silent disk corruption.
>
> Now that I mentioned, that, it''s an interesting point. Could these
be
> unified? If we crank up the checksums on files a bit, to something suitably
> useful for deduping, it could make the deduping feature almost free.
This is what ZFS does.  Every block in the pool has a checksum
attached to it.  Originally, the default algorithm was fletcher2, with
fletcher4 and sha256 as alternates.  When dedupe was enabled, the
default was changed to fletcher4.  Dedupe also came with the option to
enable/disable a byte-for-byte verify when the hashes match.

By switching the checksum algorithm for the pool to sha256 ahead of
time, you can enable dedupe, and get the dedupe checksumming for free.
 :)
>> Also, the OS is small even if identical on multiple virtual images, how
>> much is going to occupy anyway? Less than 5GB per disk image usually.
>> And that''s the only thing that would be deduped because data
likely to
>> be different on each instance. How many VMs running you have? 20?
That''s
>> at most 100GB saved one-time at the cost of a lot of fragmentation.
>
> That''s also 100GB fewer disk blocks in contention for page cache.
If you''re
> hitting the disks, you''re already going to slow down by several
orders of
> magnitude. Better to make the caching more effective.
If you setup your VMs as diskless images, using NFS off a storage
server running <whatever FS> using dedupe, you can get a lot more out
of it than using disk image files (where you have all the block sizes
and alignment to worry about).  And the you can use all the fancy
snapshotting, cloning, etc features of <whatever FS> as well.

-- 
Freddie Cash
fjwcash@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter A

2011-Jan-06 03:58 UTC

head link

Re: Offline Deduplication for Btrfs

On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:> > I''d just make it always use the fs block size. No point in
making it
> > variable.
>
> Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this is one topic I (unfortunately) have experience with...

You wouldn''t believe the difference variable block size dedupe makes.
For a
pure fileserver, its ok to dedupe on block level but for most other uses,
variable is king. One big example is backups. Netbackup and most others
produce one stream with all data even when backing up to disk. Imagine you
move a whole lot of data from one dir to another. Think a directory with huge
video files. As a filesystem it would be de-duped nicely. The backup stream
however may and may not have matching fs blocks. If the directory name before
and after has the same lengths and such - then yeah, dedupe works. Directory
name is a byte shorter? Everything in the stream will be offset by one byte -
and no dedupe will occur at all on the whole dataset. In real world just
compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block
based) to a DataDomain (variable lenght) in this usage scenario. Among our
customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05
in the 7000.

There are many other examples and in the end it comes down to if you want
general purpose de-dupe (e.g. something useful when you serve iscsi luns,
serve as backup target, ...) or if you only care about a pure file store,
you''re probably going to be ok with fixed block lengths...

Hope that helps,

Peter.

--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Yan, Zheng

2011-Jan-06 08:25 UTC

head link

Re: Offline Deduplication for Btrfs

On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik <josef@redhat.com>
wrote:> Here are patches to do offline deduplication for Btrfs.  It works well for
the
> cases it''s expected to, I''m looking for feedback on the
ioctl interface and
> such, I''m well aware there are missing features for the userspace
app (like
> being able to set a different blocksize).  If this interface is acceptable
I
> will flesh out the userspace app a little more, but I believe the kernel
side is
> ready to go.
>
> Basically I think online dedup is huge waste of time and completely
useless.
> You are going to want to do different things with different data.  For
example,
> for a mailserver you are going to want to have very small blocksizes, but
for
> say a virtualization image store you are going to want much larger
blocksizes.
> And lets not get into heterogeneous environments, those just get much too
> complicated.  So my solution is batched dedup, where a user just runs this
> command and it dedups everything at this point.  This avoids the very
costly
> overhead of having to hash and lookup for duplicate extents online and lets
us
> be _much_ more flexible about what we want to deduplicate and how we want
to do
> it.
>
> For the userspace app it only does 64k blocks, or whatever the largest area
it
> can read out of a file.  I''m going to extend this to do the
following things in
> the near future
>
> 1) Take the blocksize as an argument so we can have bigger/smaller blocks
> 2) Have an option to _only_ honor the blocksize, don''t try and
dedup smaller
> blocks
> 3) Use fiemap to try and dedup extents as a whole and just ignore specific
> blocksizes
> 4) Use fiemap to determine what would be the most optimal blocksize for the
data
> you want to dedup.
>
> I''ve tested this out on my setup and it seems to work well.  I
appreciate any
> feedback you may have.  Thanks,
>
FYI: Using clone ioctl can do the same thing (except reading data and
computing hash in user space).

Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tomasz Chmielewski

2011-Jan-06 09:37 UTC

head link

Re: Offline Deduplication for Btrfs

> I have been thinking a lot about de-duplication for a backup application
> I am writing. I wrote a little script to figure out how much it would
> save me. For my laptop home directory, about 100 GiB of data, it was a
> couple of percent, depending a bit on the size of the chunks. With 4 KiB
> chunks, I would save about two gigabytes. (That''s assuming no MD5
hash
> collisions.) I don''t have VM images, but I do have a fair bit of
saved
> e-mail. So, for backups, I concluded it was worth it to provide an
> option to do this. I have no opinion on whether it is worthwhile to do
> in btrfs.
Online deduplication is very useful for backups of big, multi-gigabyte 
files which change constantly.
Some mail servers store files this way; some MUA store the files like 
this; databases are also common to pack everything in big files which 
tend to change here and there almost all the time.

Multi-gigabyte files which only have few megabytes changed can''t be 
hardlinked; simple maths shows that even compressing multiple files 
which have few differences will lead to greater space usage than a few 
megabytes extra in each (because everything else is deduplicated).

And I don''t even want to think about IO needed to offline dedup a 
multi-terabyte storage (1 TB disks and bigger are becoming standard 
nowadays) i.e. daily, especially when the storage is already heavily 
used in IO terms.


Now, one popular tool which can deal with small changes in files is 
rsync. It can be used to copy files over the network - so that if you 
want to copy/update a multi-gigabyte file which only has a few changes, 
rsync would need to transfer just a few megabytes.

On disk however, rsync creates a "temporary copy" of the original
file,
where it packs unchanged contents together with any changes made. For 
example, while it copies/updates a file, we will have:

original_file.bin
.temporary_random_name

Later, original_file.bin would be removed, and .temporary_random_name 
would be renamed to original_file.bin. Here goes away any deduplication 
we had so far, we have to start the IO over again.


-- 
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Hommey

2011-Jan-06 09:51 UTC

head link

Re: Offline Deduplication for Btrfs

On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski
wrote:> >I have been thinking a lot about de-duplication for a backup
application
> >I am writing. I wrote a little script to figure out how much it would
> >save me. For my laptop home directory, about 100 GiB of data, it was a
> >couple of percent, depending a bit on the size of the chunks. With 4
KiB
> >chunks, I would save about two gigabytes. (That''s assuming no
MD5 hash
> >collisions.) I don''t have VM images, but I do have a fair bit
of saved
> >e-mail. So, for backups, I concluded it was worth it to provide an
> >option to do this. I have no opinion on whether it is worthwhile to do
> >in btrfs.
> 
> Online deduplication is very useful for backups of big,
> multi-gigabyte files which change constantly.
> Some mail servers store files this way; some MUA store the files
> like this; databases are also common to pack everything in big files
> which tend to change here and there almost all the time.
> 
> Multi-gigabyte files which only have few megabytes changed can''t
be
> hardlinked; simple maths shows that even compressing multiple files
> which have few differences will lead to greater space usage than a
> few megabytes extra in each (because everything else is
> deduplicated).
> 
> And I don''t even want to think about IO needed to offline dedup a
> multi-terabyte storage (1 TB disks and bigger are becoming standard
> nowadays) i.e. daily, especially when the storage is already heavily
> used in IO terms.
> 
> 
> Now, one popular tool which can deal with small changes in files is
> rsync. It can be used to copy files over the network - so that if
> you want to copy/update a multi-gigabyte file which only has a few
> changes, rsync would need to transfer just a few megabytes.
> 
> On disk however, rsync creates a "temporary copy" of the original
> file, where it packs unchanged contents together with any changes
> made. For example, while it copies/updates a file, we will have:
> 
> original_file.bin
> .temporary_random_name
> 
> Later, original_file.bin would be removed, and
> .temporary_random_name would be renamed to original_file.bin. Here
> goes away any deduplication we had so far, we have to start the IO
> over again.
Sounds like all you need is cp --reflink=always and rsync --inplace.

Haven''t tested is that works well, though.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 10:33 UTC

head link

Re: Offline Deduplication for Btrfs

Chris Mason wrote:> Excerpts from Gordan Bobic''s message of 2011-01-05 12:42:42 -0500:
>> Josef Bacik wrote:
>>
>>> Basically I think online dedup is huge waste of time and completely
useless.
>> I couldn''t disagree more. First, let''s consider what
is the
>> general-purpose use-case of data deduplication. What are the resource 
>> requirements to perform it? How do these resource requirements differ 
>> between online and offline?
> 
> I don''t really agree with Josef that dedup is dumb, but I do think
his
> current approach is the most reasonable.  Dedup has a few very valid use
> cases, which I think break down to:
> 
> 1) backups
> 2) VM images.
> 
> The backup farm use case is the best candidate for dedup in general
> because they are generally write once and hopefully read never.
> Fragmentation for reading doesn''t matter at all and we''re
really very
> sure we''re going to backup the same files over and over again.
> 
> But, it''s also something that will be dramatically more efficient
when
> the backup server helps out.  The backup server knows two files have the
> same name, same size and can guess with very high accuracy that they
> will be the same.  So it is a very good candidate for Josef''s
offline
> dedup because it can just do the dedup right after writing the file.
File level deduplication in addition to block level would be great, no 
argument there. This can again be done more efficiently in-line, though, 
as the files come in.
> Next is the VM images.  This is actually a much better workload for
> online dedup, except for the part where our poor storage server would be
> spending massive amounts of CPU deduping blocks for all the VMs on the
> machine.  In this case the storage server doesn''t know the
> filenames, it just has bunches of blocks that are likely to be the same
> across VMs.
I''m still unconvinced that deduping''s major cost is CPU. I
think in any
real case it''ll be disk I/O.
> So, it seems a bit silly to do this out of band, where we wander through
> the FS and read a bunch of blocks in hopes of finding ones with the same
> hash.
Except you can get this almost for free. How about this approach:

1) Store a decent size hash for each block (checksum for ECC - something 
like this already happens, it''s just a question of what hashing 
algorithm to use)

2) Keep a (btree?) index of all known hashes (this doesn''t happen at
the
moment, AFAIK, so this would be the bulk of the new cost for dedup).

Now there are 2 options:

3a) Offline - go through the index, find the blocks with duplicate 
hashes, relink the pointers to one of them and free the rest. There is 
no need to actually read/write any data unless we are doing full block 
compare, only metadata needs to be updated. The problem with this is 
that you would still have to do a full index scan of the index to find 
all the duplicates, unless there is a second index specifically listing 
the duplicate blocks (maintained at insertion time).

3b) Online - look up whether the hash for the current block is already 
in the index ( O(log(n)) ), and if it is, don''t bother writing the data
block, only add a pointer to the existing block. No need for a second 
index with duplicate blocks in this case, either.
> But, one of the things on our features-to-implement page is to wander
> through the FS and read all the blocks from time to time.  We want to do
> this in the background to make sure the bits haven''t rotted on
disk.  By
> scrubbing from time to time we are able to make sure that when a disk
> does die, other disks in the FS are likely to have a good copy.
Scrubbing the whole FS seems like an overly expensive way to do things 
and it also requires low-load times (which don''t necessarily exist).
How
about scrubbing disk-by-disk, rather than the whole FS? If we keep 
checksums per block, then each disk can be checked for rot 
independently. It also means that if redundancy is used, the system 
doesn''t end up anywhere nearly as crippled during the scrubbing as the 
requests can be served from other disks that are a part of that FS (e.g. 
the mirrored pair in RAID1, or the parity blocks in higher level RAIDs).
> So again, Josef''s approach actually works very well.  His dedup
util
> could be the scrubbing util and we''ll get two projects for the
price of
> one.
Indeed, the scrubber would potentially give deduping functionality for 
free, but I''m not convinced that having deduping depend on scrubbing is
the way forward. This is where we get to multi-tier deduping again - 
perhaps have things markable for online or offline dedupe, as deemed 
more appropriate?
> As for the security of hashes, we''re unlikely to find a collision
on a
> sha256 that wasn''t made maliciously.  If the system''s
data is
> controlled and you''re not worried about evil people putting files
on
> there, extra reads really aren''t required.
If you manage to construct one maliciously that''s pretty bad in itself,
though. :)
> But then again, extra reads are a good thing (see above about
> scrubbing).
Only under very, very controlled conditions. This is supposed to be the 
"best" file system, not the slowest. :)
> The complexity of the whole operation goes down
> dramatically when we do the verifications because hash index corruptions
> (this extent has this hash) will be found instead of blindly trusted.
That is arguably an issue whatever you do, though. You have to trust 
that the data you get back off the disk is correct at least most of the 
time. Also, depending on what the extent of corruption is, you would 
potentially be able to spot it by having a hash out of order in the 
index (much cheaper, but says nothing about the integrity of the block 
itself).
> None of this means that online dedup is out of the question, I just
> think the offline stuff is a great way to start.

As I said, I''m not against offline dedupe for certain use cases,
I''m
merely saying that at a glance, overall, online dedup is more efficient.

Another point that was raised was about fragmentation. Thinking about 
it, there wouldn''t be any extra fragmentation where complete
files''
worth of blocks were deduped. You''d end up with the blocks still laid 
out sequentially on the disk (assuming we don''t deliberately pick a 
random instance of a block to link to but instead do something sensible).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 10:39 UTC

head link

Re: Offline Deduplication for Btrfs

Spelic wrote:> On 01/06/2011 02:03 AM, Gordan Bobic wrote:
>>
>> That''s just alarmist. AES is being cryptanalyzed because
everything
>> uses it. And the news of it''s insecurity are somewhat
exaggerated (for
>> now at least). 
> 
> Who cares... the fact of not being much used is a benefit for RIPEMD / 
> blowfish-twofish then.
> Nobody makes viruses for Linux because they target windows. Same thing...
> RIPEMD has still an advantage over SHA imho, and blowfish over AES.
Just because nobody attacked it yet doesn''t justify complacency.
>>> If there is full blocks compare, a simpler/faster algorithm could
be
>>> chosen, like md5. Or even a md-64bits which I don''t think
it exists, but
>>> you can take MD4 and then xor the first 8 bytes with the second 8
bytes
>>> so to reduce it to 8 bytes only. This is just because it saves 60%
of
>>> the RAM occupation during dedup, which is expected to be large, and
the
>>> collisions are still insignificant at 64bits. Clearly you need to
do
>>> full blocks compare after that.
>>
>> I really don''t think the cost in terms of a few bytes per file
for the
>> hashes is that significant.
> 
> 20 to 8 = 12 bytes per *filesystem block* saved, I think
> Aren''t we talking about block-level deduplication?
> For every TB of filesystem you occupy 2GB of RAM with hashes instead of 
> 5.3GB (I am assuming 4K blocks, I don''t remember how big are btrfs
blocks)
> For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might 
> be the edge between feasible and not feasible.
> Actually it might not be feasible anyway... an option to store hashes 
> into a ssd should be provided then...
You wouldn''t necessarily have to keep the whole index in RAM, but if
you
don''t you''d get hit for an extra O(log(n)) disk seeks.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 10:48 UTC

head link

Re: Offline Deduplication for Btrfs

Peter A wrote:> On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
>>> I''d just make it always use the fs block size. No point in
making it
>>> variable.
>> Agreed. What is the reason for variable block size?
> 
> First post on this list - I mostly was just reading so far to learn more on
fs
> design but this is one topic I (unfortunately) have experience with... 
> 
> You wouldn''t believe the difference variable block size dedupe
makes. For a
> pure fileserver, its ok to dedupe on block level but for most other uses, 
> variable is king. One big example is backups. Netbackup and most others 
> produce one stream with all data even when backing up to disk. Imagine you 
> move a whole lot of data from one dir to another. Think a directory with
huge
> video files. As a filesystem it would be de-duped nicely. The backup stream
> however may and may not have matching fs blocks. If the directory name
before
> and after has the same lengths and such - then yeah, dedupe works.
Directory
> name is a byte shorter? Everything in the stream will be offset by one byte
-
> and no dedupe will occur at all on the whole dataset. In real world just 
> compare the dedupe performance of an Oracle 7000 (zfs and therefore fs
block
> based) to a DataDomain (variable lenght) in this usage scenario. Among our 
> customers we see something like 3 to 17x dedupe ration on the DD, 1.02 -
1.05
> in the 7000.
Can you elaborate what you''re talking about here? How does the length
of
a directory name affect alignment of file block contents? I don''t see 
how variability of length matters, other than to make things a lot more 
complicated. Have you some real research/scientifically gathered data 
(non-hearsay/non-anecdotal) on the underlying reasons for the 
discrepancy in the deduping effectiveness you describe? 3-17x difference 
doesn''t plausibly come purely from fixed vs. variable length block
sizes.

The only case where I''d bother consider variable length deduping is in 
file deduping (rather than block), in this case we can just make a COW 
hard-link and it''s _really_ cheap and effective.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 10:52 UTC

head link

Re: Offline Deduplication for Btrfs

Tomasz Chmielewski wrote:>> I have been thinking a lot about de-duplication for a backup
application
>> I am writing. I wrote a little script to figure out how much it would
>> save me. For my laptop home directory, about 100 GiB of data, it was a
>> couple of percent, depending a bit on the size of the chunks. With 4
KiB
>> chunks, I would save about two gigabytes. (That''s assuming no
MD5 hash
>> collisions.) I don''t have VM images, but I do have a fair bit
of saved
>> e-mail. So, for backups, I concluded it was worth it to provide an
>> option to do this. I have no opinion on whether it is worthwhile to do
>> in btrfs.
> 
> Online deduplication is very useful for backups of big, multi-gigabyte 
> files which change constantly.
> Some mail servers store files this way; some MUA store the files like 
> this; databases are also common to pack everything in big files which 
> tend to change here and there almost all the time.
> 
> Multi-gigabyte files which only have few megabytes changed can''t
be
> hardlinked; simple maths shows that even compressing multiple files 
> which have few differences will lead to greater space usage than a few 
> megabytes extra in each (because everything else is deduplicated).
> 
> And I don''t even want to think about IO needed to offline dedup a 
> multi-terabyte storage (1 TB disks and bigger are becoming standard 
> nowadays) i.e. daily, especially when the storage is already heavily 
> used in IO terms.
> 
> 
> Now, one popular tool which can deal with small changes in files is 
> rsync. It can be used to copy files over the network - so that if you 
> want to copy/update a multi-gigabyte file which only has a few changes, 
> rsync would need to transfer just a few megabytes.
> 
> On disk however, rsync creates a "temporary copy" of the original
file,
> where it packs unchanged contents together with any changes made. For 
> example, while it copies/updates a file, we will have:
> 
> original_file.bin
> .temporary_random_name
> 
> Later, original_file.bin would be removed, and .temporary_random_name 
> would be renamed to original_file.bin. Here goes away any deduplication 
> we had so far, we have to start the IO over again.
You can tell rsync to either modify the file in place (--inplace) or to 
put the temp file somewhere else (--temp-dir=DIR).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Simon Farnsworth

2011-Jan-06 12:18 UTC

head link

Re: Offline Deduplication for Btrfs

Gordan Bobic wrote:
> Josef Bacik wrote:
> 
>> Basically I think online dedup is huge waste of time and completely
>> useless.
> 
> I couldn''t disagree more. First, let''s consider what is
the
> general-purpose use-case of data deduplication. What are the resource
> requirements to perform it? How do these resource requirements differ
> between online and offline?<snip>
> As an aside, zfs and lessfs both do online deduping, presumably for a
> good reason.
> 
> Then again, for a lot of use-cases there are perhaps better ways to
> achieve the targed goal than deduping on FS level, e.g. snapshotting or
> something like fl-cow:
> http://www.xmailserver.org/flcow.html
> Just a small point; Josef''s work provides a building block for a
userspace
notify-based online dedupe daemon.

The basic idea is to use fanotify/inotify (whichever of the notification 
systems works for this) to track which inodes have been written to. It can 
then mmap() the changed data (before it''s been dropped from RAM) and do
the
same process as an offline dedupe (hash, check for matches, call dedupe 
extent ioctl). If you''ve got enough CPU (maybe running with realtime
privs),
you should be able to do this before writes actually hit the disk.

Further, a userspace daemon can do more sophisticated online dedupe than is 
reasonable in the kernel - e.g. queue the dedupe extent ioctl phase for idle 
time, only dedupe inodes that have been left unwritten for x minutes, 
different policies for different bits of the filesystem (dedupe crontabs 
immediately on write, dedupe outgoing mail spool only when the mail sticks 
around for a while, dedupe all incoming mail immediately, dedupe logfiles 
after rotation only, whatever is appropriate).

It can also do more intelligent trickery than is reasonable in-kernel - e.g. 
if you know that you''re deduping e-mail (line-based), you can search
line-
by-line for dedupe blocks, rather than byte-by-byte.

Having said all that, you may well find that having implemented a userspace 
online dedupe daemon, there are things the kernel can do to help; you may 
even find that you do need to move it entirely into the kernel. Just
don''t
think that this ioctl rules out online dedupe - in fact, it enables it.

-- 
Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 12:29 UTC

head link

Re: Offline Deduplication for Btrfs

Simon Farnsworth wrote:
> The basic idea is to use fanotify/inotify (whichever of the notification 
> systems works for this) to track which inodes have been written to. It can 
> then mmap() the changed data (before it''s been dropped from RAM)
and do the
> same process as an offline dedupe (hash, check for matches, call dedupe 
> extent ioctl). If you''ve got enough CPU (maybe running with
realtime privs),
> you should be able to do this before writes actually hit the disk.
I''m not convinced that racing against the disk write is the way forward
here.

As for having enough CPU to do this, a lot of modern CPUs (ARM, SPARC, 
Xeon) actually have hardware crypto acceleration/offload, so calculating 
checksums is fast and cheap.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Simon Farnsworth

2011-Jan-06 13:30 UTC

head link

Re: Offline Deduplication for Btrfs

Gordan Bobic wrote:
> Simon Farnsworth wrote:
> 
>> The basic idea is to use fanotify/inotify (whichever of the
notification
>> systems works for this) to track which inodes have been written to. It
>> can then mmap() the changed data (before it''s been dropped
from RAM) and
>> do the same process as an offline dedupe (hash, check for matches, call
>> dedupe extent ioctl). If you''ve got enough CPU (maybe running
with
>> realtime privs), you should be able to do this before writes actually
hit
>> the disk.
> 
> I''m not convinced that racing against the disk write is the way
forward
> here.
> The point is that implementing a userspace online dedupe daemon that races 
against the disk write is something that can be done by anyone who cares as 
soon as Josef''s patch is in place; if it''s clear that the
userspace daemon
just does something simple enough to put in the kernel (e.g. a fixed block 
size dedupe), and that extra complexity doesn''t gain enough to be 
worthwhile, the code can be ported into the kernel before it gets posted 
here.

Similarly, if you''re convinced that it has to be in kernel
(I''m not a dedupe
or filesystems expert, so there may be good reasons I''m unaware of),
you can
reuse parts of Josef''s code to write your patch that creates a kernel
thread
to do the work.

If it turns out that complex algorithms for online dedupe are worth the 
effort (like line-by-line e-mail dedupe), then you''ve got a starting
point
for writing something more complex, and determining just what the kernel 
needs to provide to make things nice - maybe it''ll be clear that you
need an
interface that lets you hold up a write while you do the simple end of the 
dedupe work, maybe there will be some other kernel interface that''s
more
generic than "dedupe fixed size blocks" that''s needed for
efficient work.

Either way, Josef''s work is a good starting point for online dedupe;
you can
experiment *now* without going into kernel code (heck, maybe even not C - 
Python or Perl would be OK for algorithm exploration), and use the offline 
dedupe support to simplify a patch for online dedupe.

-- 
Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter A

2011-Jan-06 13:33 UTC

head link

Re: Offline Deduplication for Btrfs

On Thursday, January 06, 2011 05:48:18 am you wrote:> Can you elaborate what you''re talking about here? How does the
length of
> a directory name affect alignment of file block contents? I don''t
see
> how variability of length matters, other than to make things a lot more
> complicated.I''m saying in a filesystem it doesn''t matter - if you bundle
everything into a
backup stream, it does. Think of tar. 512 byte allignment. I tar up a 
directory with 8TB total size. No big deal. Now I create a new, empty file in 
this dir with a name that just happens to be the first in the dir. This adds 
512 bytes close to the beginning of the tar file the second time I run tar. Now 
the remainder of the is all offset by 512bytes and, if you do dedupe on fs-
block sized chunks larger than the 512bytes, not a single byte will be de-
duped. 
I know its a stupid example but it matches the backup-target and database dump 
usage pattern really well. Files backing iSCSI shows similar dedupe behavior. 
Essentially every time you bundle mutliple files together into one you run into 
things like that.
 > Have you some real research/scientifically gathered data
> (non-hearsay/non-anecdotal) on the underlying reasons for the
> discrepancy in the deduping effectiveness you describe? 3-17x difference
> doesn''t plausibly come purely from fixed vs. variable length block
sizes.Personal experience isn''t hearsay :) Netapp publishes a whitepaper
against the
7000 making this a big point but that isn''t publicly available. Try
search
"zfs dedupe +netbackup" or "zfs dedupe +datadomain" and
similar - you will
hear of hundreds of people all complain about the same thing.
> The only case where I''d bother consider variable length deduping
is in
> file deduping (rather than block), in this case we can just make a COW
> hard-link and it''s _really_ cheap and effective.I take your word for this :) 
> 
> Gordan
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 14:00 UTC

head link

Re: Offline Deduplication for Btrfs

Peter A wrote:> On Thursday, January 06, 2011 05:48:18 am you wrote:
>> Can you elaborate what you''re talking about here? How does the
length of
>> a directory name affect alignment of file block contents? I
don''t see
>> how variability of length matters, other than to make things a lot more
>> complicated.
 >> I''m saying in a filesystem it doesn''t matter - if you
bundle everything into a
> backup stream, it does. Think of tar. 512 byte allignment. I tar up a 
> directory with 8TB total size. No big deal. Now I create a new, empty file
in
> this dir with a name that just happens to be the first in the dir. This
adds
> 512 bytes close to the beginning of the tar file the second time I run tar.
Now
> the remainder of the is all offset by 512bytes and, if you do dedupe on fs-
> block sized chunks larger than the 512bytes, not a single byte will be de-
> duped. 
OK, I get what you mean now. And I don''t think this is something that 
should be solved in the file system.
> I know its a stupid example but it matches the backup-target and database
dump
> usage pattern really well. Files backing iSCSI shows similar dedupe
behavior.
> Essentially every time you bundle mutliple files together into one you run
into
> things like that.
I suspect a large part of the underlying cause of this is that most 
things still operate on 512 byte sectors. Once that is replaced with 4KB 
sectors, that will probably go away. And if this is the case, then 
perhaps making the block size "variable" to 512, 1024, 2048 and 4096 
bytes (probably in reverse order) will achieve most of that benefit.

Whether than is a worthwhile thing to do for poorly designed backup 
solutions, but I''m not convinced about the general use-case.
It''d be
very expensive and complicated for seemingly very limited benefit.
>> Have you some real research/scientifically gathered data
>> (non-hearsay/non-anecdotal) on the underlying reasons for the
>> discrepancy in the deduping effectiveness you describe? 3-17x
difference
>> doesn''t plausibly come purely from fixed vs. variable length
block sizes.
 >> Personal experience isn''t hearsay :) Netapp publishes a whitepaper
against the
> 7000 making this a big point but that isn''t publicly available.
Try search
> "zfs dedupe +netbackup" or "zfs dedupe +datadomain" and
similar - you will
> hear of hundreds of people all complain about the same thing.
Typical. And no doubt they complain that ZFS isn''t doing what they
want,
rather than netbackup not co-operating. The solution to one misdesign 
isn''t an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ondřej Bílka

2011-Jan-06 14:20 UTC

head link

Re: Offline Deduplication for Btrfs

On Thu, Jan 06, 2011 at 12:18:34PM +0000, Simon Farnsworth
wrote:> Gordan Bobic wrote:
> 
> > Josef Bacik wrote:
> > snip
 > > > Then again, for a lot of use-cases there are perhaps better ways to
> > achieve the targed goal than deduping on FS level, e.g. snapshotting
or
> > something like fl-cow:
> > http://www.xmailserver.org/flcow.html
> > As VM are concerned fl-cow is poor replacement of deduping.

Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more becomes duped
again.

If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux

And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it five
minutes ago.
 > --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 

user to computer ration too low.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tomasz Torcz

2011-Jan-06 14:30 UTC

head link

Re: Offline Deduplication for Btrfs

On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:> >CPU can handle considerably more than 250 block hashings per
> >second. You could argue that this changes in cases of sequential
> >I/O on big files, but a 1.86GHz GHz Core2 can churn through
> >111MB/s of SHA256, which even SSDs will struggle to keep up with.
> 
> A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
> A SSD can do more like 200MB/sec write 280MB/sec read sequential or
> random and is actually limited only by the SATA 3.0gbit/sec but soon
> enough they will have SATA/SAS 6.0gbit/sec.
  By “soon enough” you really meant “a year ago”, I think:
http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300
Current 6Gbps SSD are doing 415 MB/s sequential:
http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads
or even claim 550MB/s:
http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000
(funny bit: Sandforce SSD controllers dedup internally). 

  Anyway, 6Gbps is not a future tale, but something long available.
And not the fastest kids on the block:  currently build filesystems
must deal storage providing many gigabytes per second.  Think
of massive disk arrays or stuff like Oracle F5100, claiming
12.8GB/sec read and ~10GB/s write (in one rack unit).

-- 
Tomasz Torcz                 "God, root, what''s the
difference?"
xmpp: zdzichubg@chrome.pl         "God is more forgiving."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 14:41 UTC

head link

Re: Offline Deduplication for Btrfs

Ondřej Bílka wrote:
>>> Then again, for a lot of use-cases there are perhaps better ways to
>>> achieve the targed goal than deduping on FS level, e.g.
snapshotting or
>>> something like fl-cow:
>>> http://www.xmailserver.org/flcow.html
>>>
> As VM are concerned fl-cow is poor replacement of deduping.
Depends on your VM. If your VM uses monolithic images, then you''re 
right. For a better solution, take a look at vserver''s hashify feature 
for something that does this very well in it''s own context.
> Upgrading packages? 1st vm upgrades and copies changed files.
> After while second upgrades and copies files too. More and more becomes
duped again.
So you want online dedupe, then. :)
> If you host multiple distributions you need to translate
> that /usr/share/bin/foo in foonux is /us/bin/bar in barux
The chances of the binaries being the same between distros are between 
slim and none. In the context of VMs where you have access to raw files, 
as I said, look at vserver''s hashify feature. It doesn''t care
about file
names, it will COW hard-link all files with identical content. This 
doesn''t even require an exhaustive check of all the files''
contents -
you can start with file sizes. Files that have different sizes can''t 
have the same contents, so you can discard most of the comparing before 
you even open the file, most of the work gets done based on metadata alone.
> And primary reason to dedupe is not to reduce space usage but to
> improve caching. Why should machine A read file if machine B read it five
minutes ago.
Couldn''t agree more. This is what I was trying to explain earlier. Even
if deduping did cause more fragmentation (and I don''t think that is the
case to any significant extent), the improved caching efficiency would 
more than offset this.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 14:49 UTC

head link

Re: Offline Deduplication for Btrfs

Tomasz Torcz wrote:> On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:
>>> CPU can handle considerably more than 250 block hashings per
>>> second. You could argue that this changes in cases of sequential
>>> I/O on big files, but a 1.86GHz GHz Core2 can churn through
>>> 111MB/s of SHA256, which even SSDs will struggle to keep up with.
>> A normal 1TB disk with platters can do 130MB/sec sequential, no
problems.
>> A SSD can do more like 200MB/sec write 280MB/sec read sequential or
>> random and is actually limited only by the SATA 3.0gbit/sec but soon
>> enough they will have SATA/SAS 6.0gbit/sec.
> 
>   By “soon enough” you really meant “a year ago”, I think:
> http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300
> Current 6Gbps SSD are doing 415 MB/s sequential:
>
http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads
> or even claim 550MB/s:
>
http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000
> (funny bit: Sandforce SSD controllers dedup internally). 
> 
>   Anyway, 6Gbps is not a future tale, but something long available.
> And not the fastest kids on the block:  currently build filesystems
> must deal storage providing many gigabytes per second.  Think
> of massive disk arrays or stuff like Oracle F5100, claiming
> 12.8GB/sec read and ~10GB/s write (in one rack unit).
Sequential figures look nice and impressive but we all know they are 
meaningless for most real world workloads. IOPS are where it''s at. And 
maybe you can get 100,000 IOPS out of an SSD. But that still means 
100,000 SHA256 hashes/second. That''s 3.2MB/s of SHA256 hashes, or about
2% of what a modern x64 CPU will do, assuming it doesn''t have a
suitable
hardware crypto accelerator for that algorithm. So on a reasonably 
recent quad core CPU you would probably be able to comfortably handle 
about 200x that before it starts becoming an issue. If you''re that 
concerned about space requirements, doing LZO compression will still be 
much more expensive.

And that''s only for writes - on reads we don''t need to do any
hashing
(although it''s useful to do for the disk error checking reasons 
explained earlier).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter A

2011-Jan-06 14:52 UTC

head link

Re: Offline Deduplication for Btrfs

On Thursday, January 06, 2011 09:00:47 am you wrote:> Peter A wrote:
> > I''m saying in a filesystem it doesn''t matter - if
you bundle everything
> > into a backup stream, it does. Think of tar. 512 byte allignment. I
tar
> > up a directory with 8TB total size. No big deal. Now I create a new,
> > empty file in this dir with a name that just happens to be the first
in
> > the dir. This adds 512 bytes close to the beginning of the tar file
the
> > second time I run tar. Now the remainder of the is all offset by
> > 512bytes and, if you do dedupe on fs- block sized chunks larger than
the
> > 512bytes, not a single byte will be de- duped.
> 
> OK, I get what you mean now. And I don''t think this is something
that
> should be solved in the file system.
<snip>> Whether than is a worthwhile thing to do for poorly designed backup
> solutions, but I''m not convinced about the general use-case.
It''d be
> very expensive and complicated for seemingly very limited benefit.Glad I finally explained myself properly... Unfortunately I disagree with you 
on the rest. If you take that logic, then I could claim dedupe is nothing a 
file system should handle - after all, its the user''s poorly designed 
applications that store multiple copies of data. Why should the fs take care 
of that? 

The problem doesn''t just affect backups. It affects everything where
you have
large data files that are not forced to allign with filesystem blocks. In 
addition to the case I mentioned above this affects in pretty much the same 
effectiveness:
* Database dumps 
* Video Editing 
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the 
backing storage). Our VM environment is backed with a 7410 and we get only 
about 10% dedupe. Copying the same images to a DataDomain results in a 60% 
reduction in space used.

Basically, every time I end up using a lot of storage space, its in a scenario 
where fs-block based dedupe is not very effective.

I also have to argue the point that these usages are "poorly
designed". Poorly
designed can only apply to technologies that existed or were talked about at 
the time the design was made. Tar and such have been around for a long time, 
way before anyone even though of dedupe. In addition, until there is a 
commonly accepted/standard API to query the block size so apps can generate 
files appropriately laid out for the backing filesystem, what is the application
supposed to do? 
If anything, I would actually argue the opposite, that fixed block dedupe is a 
poor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been 
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that 
would not be great since copying a file alligned for 8k to a 16k alligned 
filesystem, would potentially cause the same issue again)

Also from the human perspective its hard to make end users understand your 
point of view. I promote the 7000 series of storage and I know how hard it is 
to explain the dedupe behavior there. They see that Datadomain does it, and 
does it well. So why can''t solution xyz do it just as good?
> Typical. And no doubt they complain that ZFS isn''t doing what they
want,
> rather than netbackup not co-operating. The solution to one misdesign
> isn''t an expensive bodge. The solution to this particular problem
is to
> make netbackup work on per-file rather than per stream basis.I''d agree if it was just limited to netbackup... I know variable block
length
is a significantly more difficult problem than block level. That''s why
the ZFS
team made the design choice they did. Variable length is also the reason why 
the DataDomain solution is a scale out rather than scalue up approach. 
However, CPUs get faster and faster - eventually they''ll be able to
handle it.
So the right solution (from my limited point of view, as I said, I''m
not a
filesystem design expert) would be to implement the data structures to handle 
variable length. Then in the first iteration, implement the dedupe algorithm to 
only search on filesystem blocks using existing checksums and such. Less CPU 
usage, quicker development, easier debugging. Once that is stable and proven, 
you can then without requiring the user to reformat, go ahead and implement 
variable length dedupe...

Btw, thanks for your time, Gordan :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2011-Jan-06 15:07 UTC

head link

Re: Offline Deduplication for Btrfs

Peter A wrote:> On Thursday, January 06, 2011 09:00:47 am you wrote:
>> Peter A wrote:
>>> I''m saying in a filesystem it doesn''t matter - if
you bundle everything
>>> into a backup stream, it does. Think of tar. 512 byte allignment. I
tar
>>> up a directory with 8TB total size. No big deal. Now I create a
new,
>>> empty file in this dir with a name that just happens to be the
first in
>>> the dir. This adds 512 bytes close to the beginning of the tar file
the
>>> second time I run tar. Now the remainder of the is all offset by
>>> 512bytes and, if you do dedupe on fs- block sized chunks larger
than the
>>> 512bytes, not a single byte will be de- duped.
>> OK, I get what you mean now. And I don''t think this is
something that
>> should be solved in the file system.
> <snip>
>> Whether than is a worthwhile thing to do for poorly designed backup
>> solutions, but I''m not convinced about the general use-case.
It''d be
>> very expensive and complicated for seemingly very limited benefit.
> Glad I finally explained myself properly... Unfortunately I disagree with
you
> on the rest. If you take that logic, then I could claim dedupe is nothing a
> file system should handle - after all, its the user''s poorly
designed
> applications that store multiple copies of data. Why should the fs take
care
> of that? 
There is merit in that point. Some applications do in fact do their own 
deduplication, as mentioned previously on this thread.
> The problem doesn''t just affect backups. It affects everything
where you have
> large data files that are not forced to allign with filesystem blocks. In 
> addition to the case I mentioned above this affects in pretty much the same
> effectiveness:
> * Database dumps 
> * Video Editing 
> * Files backing iSCSI volumes
> * VM Images (fs blocks inside the VM rarely align with fs blocks in the 
> backing storage). Our VM environment is backed with a 7410 and we get only 
> about 10% dedupe. Copying the same images to a DataDomain results in a 60% 
> reduction in space used.
I''d be interested to hear about the relative write performance on the 
variable block size.
> I also have to argue the point that these usages are "poorly
designed". Poorly
> designed can only apply to technologies that existed or were talked about
at
> the time the design was made. Tar and such have been around for a long
time,
> way before anyone even though of dedupe. In addition, until there is a 
> commonly accepted/standard API to query the block size so apps can generate
> files appropriately laid out for the backing filesystem, what is the
application
> supposed to do?
Indeed, this goes philosophically in the direction that storage vendors 
have been (unsuccessfully) been trying to shift the industry for decades 
- object based storage. But, sadly, it hasn''t happened (yet?).
> If anything, I would actually argue the opposite, that fixed block dedupe
is a
> poor design:
> * The problem is known at the time the design was made
> * No alternative can be offered as tar, netbackup, video editing, ... has
been
> around for a long time and is unlikely to change in the near future
> * There is no standard API to query the allignment parameters (and even
that
> would not be great since copying a file alligned for 8k to a 16k alligned 
> filesystem, would potentially cause the same issue again)
>
> Also from the human perspective its hard to make end users understand your 
> point of view. I promote the 7000 series of storage and I know how hard it
is
> to explain the dedupe behavior there. They see that Datadomain does it, and
> does it well. So why can''t solution xyz do it just as good?
I''d be interested to see the evidence of the "variable
length" argument.
I have a sneaky suspicion that it actually falls back to 512 byte 
blocks, which are much more likely to align, when more sensibly sized 
blocks fail. The downside is that you don''t really want to store a 32 
byte hash key with every 512 bytes of data, so you could peel off 512 
byte blocks off the front in a hope that a bigger block that follows 
will match.

Thinking about it, this might actually not be too expensive to do. If 
the 4KB block doesn''t match, check 512 byte sub-blocks, and try peeling
them, to make the next one line up. If it doesn''t, store the mismatch
as
a full 4KB block and resume. If you do find a match, save the peeled 512 
byte blocks separately and dedupe the 4KB block.

In fact, it''s rather like the loop peeling optimization on a compiler, 
that allows you to align the data to the boundary suitable for vectorizing.
>> Typical. And no doubt they complain that ZFS isn''t doing what
they want,
>> rather than netbackup not co-operating. The solution to one misdesign
>> isn''t an expensive bodge. The solution to this particular
problem is to
>> make netbackup work on per-file rather than per stream basis.
 >> I''d agree if it was just limited to netbackup... I know variable
block length
> is a significantly more difficult problem than block level. That''s
why the ZFS
> team made the design choice they did. Variable length is also the reason
why
> the DataDomain solution is a scale out rather than scalue up approach. 
> However, CPUs get faster and faster - eventually they''ll be able
to handle it.
> So the right solution (from my limited point of view, as I said,
I''m not a
> filesystem design expert) would be to implement the data structures to
handle
> variable length. Then in the first iteration, implement the dedupe
algorithm to
> only search on filesystem blocks using existing checksums and such. Less
CPU
> usage, quicker development, easier debugging. Once that is stable and
proven,
> you can then without requiring the user to reformat, go ahead and implement
> variable length dedupe...
Actually, see above - I believe I was wrong about how expensive 
"variable length" block size is likely to be. It''s more
expensive, sure,
but not orders of magnitude more expensive, and as discussed earlier, 
given the CPU isn''t really the key bottleneck here, I think
it''d be
quite workable.
> Btw, thanks for your time, Gordan :)
You''re welcome. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ondřej Bílka

2011-Jan-06 15:37 UTC

head link

Re: Offline Deduplication for Btrfs

On Thu, Jan 06, 2011 at 02:41:28PM +0000, Gordan Bobic
wrote:> Ondřej Bílka wrote:
> 
> >>>Then again, for a lot of use-cases there are perhaps better
ways to
> >>>achieve the targed goal than deduping on FS level, e.g.
snapshotting or
> >>>something like fl-cow:
> >>>http://www.xmailserver.org/flcow.html
> >>>
> >As VM are concerned fl-cow is poor replacement of deduping.
> 
> Depends on your VM. If your VM uses monolithic images, then you''re
> right. For a better solution, take a look at vserver''s hashify
> feature for something that does this very well in it''s own
context.
> 
> >Upgrading packages? 1st vm upgrades and copies changed files.
> >After while second upgrades and copies files too. More and more becomes
duped again.
> 
> So you want online dedupe, then. :)
> 
> >If you host multiple distributions you need to translate
> >that /usr/share/bin/foo in foonux is /us/bin/bar in barux
> 
> The chances of the binaries being the same between distros are
> between slim and none. In the context of VMs where you have access
> to raw files, as I said, look at vserver''s hashify feature. It
> doesn''t care about file names, it will COW hard-link all files
with
> identical content. This doesn''t even require an exhaustive check
of
> all the files'' contents - you can start with file sizes. Files
that
> have different sizes can''t have the same contents, so you can
> discard most of the comparing before you even open the file, most of
> the work gets done based on metadata alone.
> Yes I wrote this as quick example. On second thought files shared 
between distros are typicaly write-only(like manpages)
>
> >And primary reason to dedupe is not to reduce space usage but to
> >improve caching. Why should machine A read file if machine B read it
five minutes ago.
> 
> Couldn''t agree more. This is what I was trying to explain earlier.
> Even if deduping did cause more fragmentation (and I don''t think
> that is the case to any significant extent), the improved caching
> efficiency would more than offset this.
> 
> Gordan
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 

Program load too heavy for processor to lift.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter A

2011-Jan-06 16:11 UTC

head link

Re: Offline Deduplication for Btrfs

On Thursday, January 06, 2011 10:07:03 am you wrote:> I''d be interested to see the evidence of the "variable
length" argument.
> I have a sneaky suspicion that it actually falls back to 512 byte
> blocks, which are much more likely to align, when more sensibly sized
> blocks fail. The downside is that you don''t really want to store a
32
> byte hash key with every 512 bytes of data, so you could peel off 512
> byte blocks off the front in a hope that a bigger block that follows
> will match.
> 
> Thinking about it, this might actually not be too expensive to do. If
> the 4KB block doesn''t match, check 512 byte sub-blocks, and try
peeling
> them, to make the next one line up. If it doesn''t, store the
mismatch as
> a full 4KB block and resume. If you do find a match, save the peeled 512
> byte blocks separately and dedupe the 4KB block.
> 
> In fact, it''s rather like the loop peeling optimization on a
compiler,
> that allows you to align the data to the boundary suitable for vectorizing.I''m not sure about this but to be honest I can not see any other way. 
Otherwise, how would you ever find a match? You can not store checksums of 
random sub-sections hoping that eventually stuff will match up... 512 byte is 
probably the best choice as its been a "known block size" since the
dawn of
unix.

> Actually, see above - I believe I was wrong about how expensive
> "variable length" block size is likely to be. It''s more
expensive, sure,
> but not orders of magnitude more expensive, and as discussed earlier,
> given the CPU isn''t really the key bottleneck here, I think
it''d be
> quite workable.Hmm - from my work with DD systems, it seems to be the CPU that ends up being 
the limiting factor on dedupe performance... But again, it seems that you have 
much deeper insight into this topic and have already drawn up an algorithm in 
your mind :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2011-Jan-06 16:57 UTC

head link

Re: Offline Deduplication for Btrfs

On Thursday 06 of January 2011 10:51:04 Mike Hommey
wrote:> On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
> > >I have been thinking a lot about de-duplication for a backup
application
> > >I am writing. I wrote a little script to figure out how much it
would
> > >save me. For my laptop home directory, about 100 GiB of data, it
was a
> > >couple of percent, depending a bit on the size of the chunks. With
4 KiB
> > >chunks, I would save about two gigabytes. (That''s
assuming no MD5 hash
> > >collisions.) I don''t have VM images, but I do have a fair
bit of saved
> > >e-mail. So, for backups, I concluded it was worth it to provide an
> > >option to do this. I have no opinion on whether it is worthwhile
to do
> > >in btrfs.
> > 
> > Online deduplication is very useful for backups of big,
> > multi-gigabyte files which change constantly.
> > Some mail servers store files this way; some MUA store the files
> > like this; databases are also common to pack everything in big files
> > which tend to change here and there almost all the time.
> > 
> > Multi-gigabyte files which only have few megabytes changed
can''t be
> > hardlinked; simple maths shows that even compressing multiple files
> > which have few differences will lead to greater space usage than a
> > few megabytes extra in each (because everything else is
> > deduplicated).
> > 
> > And I don''t even want to think about IO needed to offline
dedup a
> > multi-terabyte storage (1 TB disks and bigger are becoming standard
> > nowadays) i.e. daily, especially when the storage is already heavily
> > used in IO terms.
> > 
> > 
> > Now, one popular tool which can deal with small changes in files is
> > rsync. It can be used to copy files over the network - so that if
> > you want to copy/update a multi-gigabyte file which only has a few
> > changes, rsync would need to transfer just a few megabytes.
> > 
> > On disk however, rsync creates a "temporary copy" of the
original
> > file, where it packs unchanged contents together with any changes
> > made. For example, while it copies/updates a file, we will have:
> > 
> > original_file.bin
> > .temporary_random_name
> > 
> > Later, original_file.bin would be removed, and
> > .temporary_random_name would be renamed to original_file.bin. Here
> > goes away any deduplication we had so far, we have to start the IO
> > over again.
> 
> Sounds like all you need is cp --reflink=always and rsync --inplace.
> 
> Haven''t tested is that works well, though.
It works very well, btrfs with snapshots, compression and rsync --inplace has 
better storage utilisation than lessfs at around 10-15 snapshots with around 
600GB of test data in small files.

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-06 18:35 UTC

head link

Re: Offline Deduplication for Btrfs

Excerpts from Peter A''s message of 2011-01-05 22:58:36
-0500:> On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
> > > I''d just make it always use the fs block size. No point
in making it
> > > variable.
> > 
> > Agreed. What is the reason for variable block size?
> 
> First post on this list - I mostly was just reading so far to learn more on
fs
> design but this is one topic I (unfortunately) have experience with... 
> 
> You wouldn''t believe the difference variable block size dedupe
makes. For a
> pure fileserver, its ok to dedupe on block level but for most other uses, 
> variable is king. One big example is backups. Netbackup and most others 
> produce one stream with all data even when backing up to disk. Imagine you 
> move a whole lot of data from one dir to another. Think a directory with
huge
> video files. As a filesystem it would be de-duped nicely. The backup stream
> however may and may not have matching fs blocks. If the directory name
before
> and after has the same lengths and such - then yeah, dedupe works.
Directory
> name is a byte shorter? Everything in the stream will be offset by one byte
-
> and no dedupe will occur at all on the whole dataset. In real world just 
> compare the dedupe performance of an Oracle 7000 (zfs and therefore fs
block
> based) to a DataDomain (variable lenght) in this usage scenario. Among our 
> customers we see something like 3 to 17x dedupe ration on the DD, 1.02 -
1.05
> in the 7000.
What is the smallest granularity that the datadomain searches for in
terms of dedup?

Josef''s current setup isn''t restricted to a specific block
size, but
there is a min match of 4k.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter A

2011-Jan-08 00:27 UTC

head link

Re: Offline Deduplication for Btrfs

On Thursday, January 06, 2011 01:35:15 pm Chris Mason
wrote:> What is the smallest granularity that the datadomain searches for in
> terms of dedup?
> 
> Josef''s current setup isn''t restricted to a specific
block size, but
> there is a min match of 4k.I talked to a few people I know and didn''t get a clear answer either...
However, 512 bytes came up more than once. 

I''m not really worried about the size of region to be used, but about 
offsetting it... its so easy to create large tars, ... where the content is 
offset by a few bytes, mutliples of 512 and such.

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2011-Jan-10 15:28 UTC

head link

Re: Offline Deduplication for Btrfs

I think that dedup has a variety of use cases that are all very dependent on 
your workload. The approach you have here seems to be a quite reasonable one.

I did not see it in the code, but it is great to be able to collect statistics 
on how effective your hash is and any counters for the extra IO imposed.

Also very useful to have a paranoid mode where when you see a hash collision 
(dedup candidate), you fall back to a byte-by-byte compare to verify that the 
the collision is correct.  Keeping stats on how often this is a false collision 
would be quite interesting as well :)

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-10 15:37 UTC

head link

Re: Offline Deduplication for Btrfs

On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler
wrote:>
> I think that dedup has a variety of use cases that are all very dependent 
> on your workload. The approach you have here seems to be a quite 
> reasonable one.
>
> I did not see it in the code, but it is great to be able to collect 
> statistics on how effective your hash is and any counters for the extra 
> IO imposed.
>
So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?
> Also very useful to have a paranoid mode where when you see a hash 
> collision (dedup candidate), you fall back to a byte-by-byte compare to 
> verify that the the collision is correct.  Keeping stats on how often 
> this is a false collision would be quite interesting as well :)
>
So I''ve always done a byte-by-byte compare, first in userspace but now
its in
kernel, because frankly I don''t trust hashing algorithms with my data. 
It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-10 15:39 UTC

head link

Re: Offline Deduplication for Btrfs

Excerpts from Josef Bacik''s message of 2011-01-10 10:37:31
-0500:> On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
> >
> > I think that dedup has a variety of use cases that are all very
dependent
> > on your workload. The approach you have here seems to be a quite 
> > reasonable one.
> >
> > I did not see it in the code, but it is great to be able to collect 
> > statistics on how effective your hash is and any counters for the
extra
> > IO imposed.
> >
> 
> So I have counters for how many extents are deduped and the overall file
> savings, is that what you are talking about?
> 
> > Also very useful to have a paranoid mode where when you see a hash 
> > collision (dedup candidate), you fall back to a byte-by-byte compare
to
> > verify that the the collision is correct.  Keeping stats on how often 
> > this is a false collision would be quite interesting as well :)
> >
> 
> So I''ve always done a byte-by-byte compare, first in userspace but
now its in
> kernel, because frankly I don''t trust hashing algorithms with my
data.  It would
> be simple enough to keep statistics on how often the byte-by-byte compare
comes
> out wrong, but really this is to catch changes to the file, so I have a
> suspicion that most of these statistics would be simply that the file
changed,
> not that the hash was a collision.  Thanks,
At least in the kernel, if you''re comparing extents on disk that are
from a committed transaction.  The contents won''t change.  We could
read
into a private buffer instead of into the file''s address space to make
this more reliable/strict.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Jan-10 15:43 UTC

head link

Re: Offline Deduplication for Btrfs

On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason
wrote:> Excerpts from Josef Bacik''s message of 2011-01-10 10:37:31 -0500:
> > On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
> > >
> > > I think that dedup has a variety of use cases that are all very
dependent
> > > on your workload. The approach you have here seems to be a quite 
> > > reasonable one.
> > >
> > > I did not see it in the code, but it is great to be able to
collect
> > > statistics on how effective your hash is and any counters for the
extra
> > > IO imposed.
> > >
> > 
> > So I have counters for how many extents are deduped and the overall
file
> > savings, is that what you are talking about?
> > 
> > > Also very useful to have a paranoid mode where when you see a
hash
> > > collision (dedup candidate), you fall back to a byte-by-byte
compare to
> > > verify that the the collision is correct.  Keeping stats on how
often
> > > this is a false collision would be quite interesting as well :)
> > >
> > 
> > So I''ve always done a byte-by-byte compare, first in
userspace but now its in
> > kernel, because frankly I don''t trust hashing algorithms with
my data.  It would
> > be simple enough to keep statistics on how often the byte-by-byte
compare comes
> > out wrong, but really this is to catch changes to the file, so I have
a
> > suspicion that most of these statistics would be simply that the file
changed,
> > not that the hash was a collision.  Thanks,
> 
> At least in the kernel, if you''re comparing extents on disk that
are
> from a committed transaction.  The contents won''t change.  We
could read
> into a private buffer instead of into the file''s address space to
make
> this more reliable/strict.
> 
Right sorry I was talking in the userspace case.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjen Nienhuis

2011-Jan-16 00:18 UTC

head link

Re: Offline Deduplication for Btrfs

Hi,

I like your idea and implementation for offline deduplication a lot. I
think it will save me 50% of my backup storage!

Your code walks/scans the directory/file tree of the filesystem. Would
it be possible to walk/scan the disk extents sequentially in disk
order?

- This would be more I/O-efficient
- This would save you reading previously
deduped/snapshotted/hardlinked files more than once.
- Maybe this would make it possible to deduplicate directories as well.

Met vriendelijke groet,
Arjen Nienhuis

P.S. The NTFS implementation on Windows has ''ioctls'' to read
the MFT
sequentially in disk order and it''s *fast*. It''s being used
for things
like defrag.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Seemingly Similar Threads

Search for more apparently analagous threads

Btrfs devel - Jan 2011 - Offline Deduplication for Btrfs

Offline Deduplication for Btrfs

[PATCH] Btrfs: add extent-same ioctl for dedup

[PATCH] Btrfs-progs: add dedup functionality

Re: Offline Deduplication for Btrfs

Re: [PATCH] Btrfs: add extent-same ioctl for dedup

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Re: Offline Deduplication for Btrfs

Seemingly Similar Threads