Here are patches to do offline deduplication for Btrfs. It works well for the cases it''s expected to, I''m looking for feedback on the ioctl interface and such, I''m well aware there are missing features for the userspace app (like being able to set a different blocksize). If this interface is acceptable I will flesh out the userspace app a little more, but I believe the kernel side is ready to go. Basically I think online dedup is huge waste of time and completely useless. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. For the userspace app it only does 64k blocks, or whatever the largest area it can read out of a file. I''m going to extend this to do the following things in the near future 1) Take the blocksize as an argument so we can have bigger/smaller blocks 2) Have an option to _only_ honor the blocksize, don''t try and dedup smaller blocks 3) Use fiemap to try and dedup extents as a whole and just ignore specific blocksizes 4) Use fiemap to determine what would be the most optimal blocksize for the data you want to dedup. I''ve tested this out on my setup and it seems to work well. I appreciate any feedback you may have. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This adds the ability for userspace to tell btrfs which extents match eachother.
You pass in
-a logical offset
-a length
-a hash type (currently only sha256 is supported)
-the hash
-a list of file descriptors with their logical offset
and this ioctl will split up the extent on the target file and then link all of
the files with the target files extent and free up their original extent.  The
hash is to make sure nothing has changed between the userspace app running and
we doing the actual linking, so we hash everything to make sure it''s
all still
the same.  This doesn''t work in a few key cases
1) Any data transformation whatsoever.  This includes compression or any
encryption that happens later on.  This is just to make sure we''re not
deduping
things that don''t turn out to be the same stuff on disk as it is
uncompressed/decrypted.
2) Across subvolumes.  This can be fixed later, but this is just to keep odd
problems from happening, like oh say trying to dedup things that are snapshots
of eachother already.  Nothing bad will happen, it''s just needless work
so just
don''t allow it for the time being.
3) If the target file''s data is split across extents.  We need one
extent to
point everybody at, so if the target file''s data spans different
extents we
won''t work.  In this case I return ERANGE so the userspace app can call
defrag
and then try again, but currently I don''t do that, so that will have to
be fixed
at some point.
I think thats all of the special cases.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/btrfs/ioctl.c |  662 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   19 ++
 2 files changed, 681 insertions(+), 0 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f1c9bb4..15e1c63 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -40,6 +40,8 @@
 #include <linux/xattr.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
 #include "compat.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -2231,6 +2233,664 @@ static noinline long btrfs_ioctl_wait_sync(struct file
*file, void __user *argp)
 	return btrfs_wait_for_commit(root, transid);
 }
 
+static noinline int check_hash(struct inode *inode,
+			       struct btrfs_ioctl_same_args *args,
+			       u64 off, u64 len)
+{
+	struct hash_desc desc;
+	struct scatterlist sg;
+	struct page *page;
+	void *addr;
+	char *buffer;
+	char *digest = NULL;
+	char *hash = NULL;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	int digest_len;
+
+	buffer = kmalloc(len, GFP_NOFS);
+	if (!buffer)
+		return -ENOMEM;
+
+	switch (args->hash_type) {
+	case BTRFS_SAME_EXTENT_HASH_SHA256:
+		desc.tfm = crypto_alloc_hash("sha256", 0, CRYPTO_ALG_ASYNC);
+		break;
+	default:
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!desc.tfm) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	digest_len = crypto_hash_digestsize(desc.tfm);
+	if (digest_len != args->hash_len) {
+		ret = -EINVAL;
+		goto out_crypto;
+	}
+
+	hash = kmalloc(digest_len, GFP_NOFS);
+	if (!hash) {
+		ret = -ENOMEM;
+		goto out_crypto;
+	}
+
+	digest = kmalloc(digest_len, GFP_NOFS);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto out_crypto;
+	}
+
+	if (copy_from_user(hash, (char __user *)args->hash, digest_len)) {
+		ret = -EFAULT;
+		goto out_crypto;
+	}
+
+	desc.flags = 0;
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out_crypto;
+		}
+
+		if (!PageUptodate(page)) {
+			btrfs_readpage(NULL, page);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out_crypto;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+	sg_init_one(&sg, buffer, len);
+
+	if (crypto_hash_digest(&desc, &sg, sg.length, digest)) {
+		ret = -EINVAL;
+		goto out_crypto;
+	}
+
+	if (memcmp(digest, hash, digest_len)) {
+		printk(KERN_ERR "hash''s don''t match\n");
+		ret = -EINVAL;
+	}
+
+out_crypto:
+	kfree(digest);
+	kfree(hash);
+	crypto_free_hash(desc.tfm);
+out:
+	kfree(buffer);
+	printk(KERN_ERR "mark 4\n");
+	return ret;
+}
+
+static noinline int split_extent(struct btrfs_trans_handle *trans,
+				 struct inode *inode, u64 start, u64 end,
+				 u64 *bytenr, u64 *bytes, u64 *offset)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 extent_start;
+	u64 extent_end;
+	u64 extent_offset;
+	u64 search_start = start;
+	u64 disk_bytenr;
+	u64 disk_bytes;
+	int ret;
+	int recow;
+	int extent_type;
+
+	btrfs_drop_extent_cache(inode, start, end - 1, 0);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		recow = 0;
+		ret = btrfs_lookup_file_extent(trans, root, path,
+					       inode->i_ino, search_start, 1);
+		if (ret < 0)
+			break;
+		if (ret > 0 && path->slots[0] > 0) {
+			path->slots[0]--;
+		} else if (ret > 0 && path->slots[0] == 0) {
+			ret = btrfs_prev_leaf(root, path);
+			if (ret < 0)
+				break;
+			if (ret > 0) {
+				ret = 0;
+				break;
+			}
+		}
+
+next_slot:
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				break;
+			if (ret > 0) {
+				ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid > inode->i_ino ||
+		    key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
+			break;
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		if (extent_type != BTRFS_FILE_EXTENT_REG) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (btrfs_file_extent_compression(leaf, fi) ||
+		    btrfs_file_extent_encryption(leaf, fi) ||
+		    btrfs_file_extent_other_encoding(leaf, fi)) {
+			printk(KERN_ERR "cannot dedup transformed extents\n");
+			ret = -EINVAL;
+			break;
+		}
+
+		extent_start = key.offset;
+		extent_end = extent_start +
+			btrfs_file_extent_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		disk_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+
+		if (extent_end < search_start) {
+			path->slots[0]++;
+			goto next_slot;
+		}
+
+		search_start = max(key.offset, start);
+		if (recow) {
+			btrfs_release_path(root, path);
+			continue;
+		}
+
+		/*
+		 *     | - range to split - |
+		 *  | -------- extent -------- |
+		 */
+		if (start > key.offset && end < extent_end) {
+			struct btrfs_key new_key;
+
+			memcpy(&new_key, &key, sizeof(new_key));
+			new_key.offset = start;
+			ret = btrfs_duplicate_item(trans, root, path,
+						   &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi, start -
+							key.offset);
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += start - key.offset;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - start);
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+
+			btrfs_mark_buffer_dirty(leaf);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino,
+						   extent_offset);
+			if (ret) {
+				int err;
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+					extent_end - extent_start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+
+			new_key.offset = end;
+			ret = btrfs_duplicate_item(trans, root, path, &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							end - start);
+
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += end - start;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - end);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino, 0);
+			if (ret) {
+				int err;
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+						extent_end - start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+			btrfs_mark_buffer_dirty(leaf);
+			ret = 0;
+			break;
+		}
+
+		/*
+		 * | - range to split -|
+		 * | ------ extent ------|
+		 */
+		if (start == key.offset && end < extent_end) {
+			struct btrfs_key new_key;
+
+			memcpy(&new_key, &key, sizeof(new_key));
+			new_key.offset = end;
+			ret = btrfs_duplicate_item(trans, root, path,
+						   &new_key);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(root, path);
+				continue;
+			}
+			if (ret < 0)
+				break;
+
+			leaf = path->nodes[0];
+			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					    struct btrfs_file_extent_item);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							end - start);
+
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+
+			fi = btrfs_item_ptr(leaf, path->slots[0],
+					    struct btrfs_file_extent_item);
+			extent_offset += end - start;
+			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+			btrfs_set_file_extent_num_bytes(leaf, fi,
+							extent_end - end);
+			btrfs_mark_buffer_dirty(leaf);
+
+			ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+						   disk_bytes, 0,
+						   root->root_key.objectid,
+						   inode->i_ino,
+						   extent_offset);
+			if (ret) {
+				int err;
+
+				WARN_ON(1);
+				fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+					struct btrfs_file_extent_item);
+				btrfs_set_file_extent_num_bytes(leaf, fi,
+						extent_end - start);
+				err = btrfs_del_items(trans, root, path,
+						      path->slots[0], 1);
+				WARN_ON(err);
+				break;
+			}
+			ret = 0;
+			break;
+		}
+
+		if (start == key.offset && end == extent_end) {
+			if (bytes)
+				*bytes = disk_bytes;
+			if (bytenr)
+				*bytenr = disk_bytenr;
+			if (offset)
+				*offset = extent_offset;
+			ret = 0;
+			break;
+		}
+
+		ret = -ERANGE;
+		break;
+	}
+
+	btrfs_free_path(path);
+	return ret;
+}
+
+static noinline int link_extent(struct btrfs_trans_handle *trans,
+				struct inode *inode, u64 disk_bytenr,
+				u64 disk_bytes, u64 extent_offset,
+				u64 offset, u64 len)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_key key;
+	u64 hint;
+	int ret;
+	int err;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	ret = btrfs_inc_extent_ref(trans, root, disk_bytenr, disk_bytes, 0,
+				   root->root_key.objectid, inode->i_ino,
+				   extent_offset);
+	if (ret)
+		return ret;
+
+	ret = btrfs_drop_extents(trans, inode, offset, offset + len,
+				 &hint, 1);
+	if (ret) {
+		err = ret;
+		btrfs_free_path(path);
+		goto out;
+	}
+
+	key.objectid = inode->i_ino;
+	key.offset = offset;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*fi));
+
+	/*
+	 * Yes I know, it sucks, but if we don''t do this we''ll lose
data, we
+	 * need to come up with a way to undo the drop_extents so that if this
+	 * part fails we can still at least have the old data.
+	 */
+	BUG_ON(ret);
+
+	leaf = path->nodes[0];
+	fi = btrfs_item_ptr(leaf, path->slots[0],
+			    struct btrfs_file_extent_item);
+
+	btrfs_set_file_extent_generation(leaf, fi, trans->transid);
+	btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG);
+	btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
+	btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_bytes);
+	btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+	btrfs_set_file_extent_num_bytes(leaf, fi, len);
+	btrfs_set_file_extent_ram_bytes(leaf, fi, len);
+	btrfs_set_file_extent_compression(leaf, fi, 0);
+	btrfs_set_file_extent_encryption(leaf, fi, 0);
+	btrfs_set_file_extent_other_encoding(leaf, fi, 0);
+
+	btrfs_mark_buffer_dirty(leaf);
+	inode_add_bytes(inode, len);
+	btrfs_free_path(path);
+
+	return 0;
+out:
+	ret = btrfs_free_extent(trans, root, disk_bytenr, disk_bytes, 0,
+				root->root_key.objectid, inode->i_ino,
+				extent_offset);
+	if (ret)
+		err = ret;
+	return err;
+}
+
+static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
+{
+	while (1) {
+		struct btrfs_ordered_extent *ordered;
+		lock_extent(&BTRFS_I(inode)->io_tree, off,
+			    off + len - 1, GFP_NOFS);
+		ordered = btrfs_lookup_first_ordered_extent(inode,
+							    off + len - 1);
+		if (!ordered &&
+		    !test_range_bit(&BTRFS_I(inode)->io_tree, off,
+				    off + len - 1, EXTENT_DELALLOC, 0, NULL))
+			break;
+		unlock_extent(&BTRFS_I(inode)->io_tree, off, off + len -1,
+			      GFP_NOFS);
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		btrfs_wait_ordered_range(inode, off, len);
+	}
+}
+
+static noinline long btrfs_ioctl_file_extent_same(struct file *file,
+						  void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct inode *src = file->f_dentry->d_inode;
+	struct btrfs_root *root;
+	struct file *dst_file;
+	struct inode *dst_inode;
+	struct btrfs_trans_handle *trans;
+	u64 off;
+	u64 len;
+	u64 bytenr;
+	u64 bytes;
+	u64 extent_offset;
+	int args_size = 0;
+	int drop_ref = 0;
+	int i;
+	int ret;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp)))
+		return -EFAULT;
+
+	if (tmp.hash_type != BTRFS_SAME_EXTENT_HASH_SHA256)
+		return -EINVAL;
+
+	args_size = sizeof(tmp) + (tmp.total_files *
+			sizeof(struct btrfs_ioctl_file_extent_info));
+
+	/* lets not let this get out of hand */
+	if (args_size > PAGE_CACHE_SIZE)
+		return -ENOMEM;
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args)
+		return -ENOMEM;
+
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		return -EFAULT;
+
+	/* Make sure args didn''t change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp))) {
+		kfree(args);
+		return -EINVAL;
+	}
+
+	if (S_ISDIR(src->i_mode)) {
+		kfree(args);
+		return -EISDIR;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+	root = BTRFS_I(src)->root;
+
+	/*
+	 * Since we have to hash the extent to make sure it''s still right, we
+	 * have to allocate a buffer to hold the data, so let''s not let that
+	 * buffer be larger than 1mb just to make sure nobody abuses this.
+	 */
+	if (len > 1 * 1024 * 1024)
+		return -ENOMEM;
+
+	lock_extent_range(src, off, len);
+
+	if (check_hash(src, args, off, len)) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		kfree(args);
+		return -EINVAL;
+	}
+
+	trans = btrfs_start_transaction(root, args->total_files + 1);
+	if (IS_ERR(trans)) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		kfree(args);
+		return PTR_ERR(trans);
+	}
+
+	ret = split_extent(trans, src, off, off + len, &bytenr, &bytes,
+			   &extent_offset);
+	if (ret) {
+		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+		btrfs_end_transaction(trans, root);
+		kfree(args);
+		return ret;
+	}
+
+	ret = btrfs_inc_extent_ref(trans, root, bytenr, bytes, 0,
+				   root->root_key.objectid, src->i_ino,
+				   extent_offset);
+	unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1, GFP_NOFS);
+	if (ret) {
+		btrfs_end_transaction(trans, root);
+		kfree(args);
+		return ret;
+	}
+
+	drop_ref = 1;
+	for (i = 0; i < args->total_files; i++) {
+		struct btrfs_ioctl_file_extent_info *info;
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+		dst_inode = dst_file->f_dentry->d_inode;
+		if (S_ISDIR(dst_inode->i_mode)) {
+			fput(dst_file);
+			printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+
+		if (BTRFS_I(dst_inode)->root != root) {
+			fput(dst_file);
+			printk(KERN_ERR "btrfs: cannot dedup across subvolumes"
+			      " %lld\n", info->fd);
+			ret = -EINVAL;
+			break;
+		}
+
+		off = info->logical_offset;
+		lock_extent_range(dst_inode, off, len);
+		if (check_hash(dst_inode, args, off, len)) {
+			printk(KERN_ERR "btrfs: hash for fd %lld does not "
+			       "match\n", info->fd);
+			unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+				      off + len - 1, GFP_NOFS);
+			fput(dst_file);
+			continue;
+		}
+
+		ret = link_extent(trans, dst_inode, bytenr, bytes,
+				  extent_offset, off, len);
+		if (ret) {
+			unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+				      off + len - 1, GFP_NOFS);
+			fput(dst_file);
+			break;
+		}
+
+		unlock_extent(&BTRFS_I(dst_inode)->io_tree, off, off + len - 1,
+			      GFP_NOFS);
+
+		fput(dst_file);
+
+		if (drop_ref) {
+			ret = btrfs_free_extent(trans, root, bytenr, bytes, 0,
+						root->root_key.objectid,
+						src->i_ino, extent_offset);
+			if (ret)
+				break;
+			drop_ref = 0;
+		}
+	}
+
+	if (drop_ref) {
+		btrfs_free_extent(trans, root, bytenr, bytes, 0,
+				  root->root_key.objectid, src->i_ino,
+				  extent_offset);
+	}
+
+	btrfs_end_transaction(trans, root);
+
+	kfree(args);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -2287,6 +2947,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_start_sync(file, argp);
 	case BTRFS_IOC_WAIT_SYNC:
 		return btrfs_ioctl_wait_sync(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 17c99eb..ca9b034 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -145,6 +145,23 @@ struct btrfs_ioctl_space_args {
 	struct btrfs_ioctl_space_info spaces[0];
 };
 
+#define BTRFS_SAME_EXTENT_HASH_SHA256	1
+
+struct btrfs_ioctl_file_extent_info {
+	__s64 fd;
+	__u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;
+	__u64 length;
+	__u64 total_files;
+	u32 hash_len;
+	u8 hash_type;
+	char *hash;
+	struct btrfs_ioctl_file_extent_info info[0];
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -189,4 +206,6 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
 				   struct btrfs_ioctl_async_vol_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+					struct btrfs_ioctl_same_args)
 #endif
-- 
1.6.6.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
This program very basically does dedup.  It searches a directory recursively and
scans all of the files looking for 64k extents and hashing them to figure out if
there are any duplicates.  After that it calls the btrfs same extent ioctl for
all of the duplicates in order to dedup the space on disk.  There is alot more
that can be done with this, like changing the block size and so on, but for now
this will get us started.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
---
 Makefile |    8 +-
 dedup.c  |  658 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ioctl.h  |   19 ++
 3 files changed, 683 insertions(+), 2 deletions(-)
 create mode 100644 dedup.c
diff --git a/Makefile b/Makefile
index 525676e..b9a51c4 100644
--- a/Makefile
+++ b/Makefile
@@ -15,10 +15,11 @@ INSTALL= install
 prefix ?= /usr/local
 bindir = $(prefix)/bin
 LIBS=-luuid
+GCRYPT_LIBS=`libgcrypt-config --libs`
+GCRYPT_CFLAGS=`libgcrypt-config --cflags`
 
 progs = btrfsctl mkfs.btrfs btrfs-debug-tree btrfs-show btrfs-vol btrfsck \
-	btrfs \
-	btrfs-map-logical
+	btrfs btrfs-map-logical btrfs-dedup
 
 # make C=1 to enable sparse
 ifdef C
@@ -50,6 +51,9 @@ btrfs-vol: $(objects) btrfs-vol.o
 btrfs-show: $(objects) btrfs-show.o
 	gcc $(CFLAGS) -o btrfs-show btrfs-show.o $(objects) $(LDFLAGS) $(LIBS)
 
+btrfs-dedup: $(objects) dedup.o
+	gcc $(CFLAGS) -o btrfs-dedup dedup.c $(objects) $(LDFLAGS) $(GCRYPT_LIBS)
$(GCRYPT_CFLAGS) $(LIBS)
+
 btrfsck: $(objects) btrfsck.o
 	gcc $(CFLAGS) -o btrfsck btrfsck.o $(objects) $(LDFLAGS) $(LIBS)
 
diff --git a/dedup.c b/dedup.c
new file mode 100644
index 0000000..5385e28
--- /dev/null
+++ b/dedup.c
@@ -0,0 +1,658 @@
+/*
+ * Copyright (C) 2011 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#define _GNU_SOURCE 1
+#define X_OPEN_SOURCE 500
+
+#include <linux/fs.h>
+#include <linux/fiemap.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <gcrypt.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include "rbtree.h"
+#include "ioctl.h"
+
+struct file_entry {
+	ino_t ino;
+	unsigned long pos;
+	unsigned long len;
+	unsigned long physical;
+	char *hash;
+	int entries;
+	int fd;
+	struct file_entry *next;
+	struct rb_node n;
+};
+
+struct file {
+	char *name;
+	ino_t ino;
+	struct rb_node n;
+};
+
+static unsigned int digest_len;
+static char *digest;
+static struct rb_root hash_tree;
+static struct rb_root file_tree;
+unsigned int buffer_len = 65536;
+
+static void print_hash(uint64_t *hash)
+{
+	int i;
+
+	for (i = 0; i < 8; i++) {
+		printf("%llx", (unsigned long long)hash[i]);
+	}
+}
+
+static char *hash_buffer(void *buffer, int len)
+{
+	gcry_md_hash_buffer(GCRY_MD_SHA256, digest, buffer, len);
+
+	return strndup(digest, digest_len);
+}
+
+static struct rb_node *hash_tree_insert(struct file_entry *fe)
+{
+	struct rb_node **p = &hash_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct file_entry *entry;
+
+	while (*p) {
+		int cmp;
+
+		parent = *p;
+		entry = rb_entry(parent, struct file_entry, n);
+
+		cmp = memcmp(fe->hash, entry->hash, digest_len);
+
+		if (cmp < 0)
+			p = &(*p)->rb_left;
+		else if (cmp > 0)
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(&fe->n, parent, p);
+	rb_insert_color(&fe->n, &hash_tree);
+
+	return NULL;
+}
+
+#if 0
+static unsigned long logical_to_physical(int fd, unsigned long logical,
+					 unsigned long len)
+{
+	struct fiemap *map;
+	struct fiemap_extent *extent;
+	unsigned long physical = 0;
+	int size = sizeof(struct fiemap) + sizeof(struct fiemap_extent);
+	int ret;
+
+	map = malloc(sizeof(struct fiemap) + sizeof(struct fiemap_extent));
+	if (!map)
+		return 0;
+
+	memset(map, 0, size);
+	map->fm_start = logical;
+	map->fm_length = len;
+	map->fm_flags = FIEMAP_FLAG_SYNC;
+	map->fm_extent_count = 1;
+	ret = ioctl(fd, FS_IOC_FIEMAP, map);
+	if (ret) {
+		fprintf(stderr, "failed to do fiemap %d\n", errno);
+		return 0;
+	}
+	extent = &map->fm_extents[0];
+
+	physical = extent->fe_physical;
+	if (extent->fe_logical < logical)
+		physical +=  (logical - extent->fe_logical);
+	free(map);
+
+	return physical;
+}
+#endif
+
+static struct rb_node *file_tree_insert(struct file *file)
+{
+	struct rb_node **p = &file_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct file *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct file, n);
+
+		if (file->ino < entry->ino)
+			p = &(*p)->rb_left;
+		else if (file->ino > entry->ino)
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(&file->n, parent, p);
+	rb_insert_color(&file->n, &file_tree);
+
+	return NULL;
+}
+
+static struct file *file_tree_search(ino_t ino)
+{
+	struct rb_node *node = file_tree.rb_node;
+	struct file *entry;
+
+	while (node) {
+		entry = rb_entry(node, struct file, n);
+		if (ino < entry->ino)
+			node = node->rb_left;
+		else if (ino > entry->ino)
+			node = node->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+static int hash_file(char *filename)
+{
+	char *buffer;
+	struct rb_node *node;
+	struct file *file;
+	int fd;
+	ssize_t bytes;
+	size_t pos = 0;
+	struct stat st;
+	ino_t ino;
+
+	buffer = malloc(buffer_len);
+	if (!buffer) {
+		fprintf(stderr, "failed to alloc buffer\n");
+		return 1;
+	}
+
+	file = malloc(sizeof(*file));
+	if (!file) {
+		free(buffer);
+		fprintf(stderr, "failed to alloc file thing\n");
+		return 1;
+	}
+
+	fd = open(filename, O_RDONLY);
+	if (fd < 0) {
+		free(buffer);
+		return 1;
+	}
+
+	if (fstat(fd, &st) < 0) {
+		fprintf(stderr, "failed to stat\n");
+		free(buffer);
+		return 1;
+	}
+
+	ino = st.st_ino;
+	file->ino = ino;
+
+	node = file_tree_insert(file);
+	if (node) {
+		printf("wtf?\n");
+		free(file);
+		free(buffer);
+		close(fd);
+		return 0;
+	}
+
+	file->name = strdup(filename);
+
+	while ((bytes = read(fd, buffer, buffer_len)) > 0) {
+		struct file_entry *entry = malloc(sizeof(struct file_entry));
+
+		if (!entry) {
+			fprintf(stderr, "failed to allocate entry\n");
+			bytes = -1;
+			break;
+		}
+
+		entry->ino = ino;
+		entry->pos = pos;
+		entry->len = bytes;
+		entry->hash = hash_buffer(buffer, bytes);
+		if (!entry->hash) {
+			fprintf(stderr, "failed to allocate hash\n");
+			bytes = -1;
+			break;
+		}
+		entry->next = NULL;
+		entry->entries = 0;
+		node = hash_tree_insert(entry);
+		if (node) {
+			struct file_entry *existing;
+
+			existing = rb_entry(node, struct file_entry, n);
+			free(entry->hash);
+			entry->hash = NULL;
+			entry->next = existing->next;
+			existing->next = entry;
+			existing->entries++;
+		} else {
+//			entry->physical = logical_to_physical(fd, pos, bytes);
+		}
+		pos += bytes;
+	}
+
+	close(fd);
+	free(buffer);
+
+	if (bytes < 0)
+		printf("Bytes negative, error %d\n", errno);
+	return (bytes < 0);
+}
+
+static void print_stats()
+{
+	struct rb_node *node;
+	int total_extents = 0;
+	int total_duplicates = 0;
+
+	for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		total_extents++;
+		if (entry->entries)
+			total_duplicates += entry->entries;
+	}
+
+	printf("Total extents hashed:\t%d\n", total_extents);
+	printf("Total duplicates:\t%d\n", total_duplicates);
+	printf("Total space saved:\t%d\n", total_duplicates * buffer_len);
+
+}
+
+static void free_hash_tree()
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&hash_tree))) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		rb_erase(node, &hash_tree);
+		do {
+			struct file_entry *temp;
+
+			if (entry->next == NULL)
+				break;
+			temp = entry->next;
+			free(entry->hash);
+			free(entry);
+			entry = temp;
+		} while (1);
+
+		free(entry->hash);
+		free(entry);
+	}
+}
+
+static void free_file_tree()
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&file_tree))) {
+		struct file *entry;
+		entry = rb_entry(node, struct file, n);
+		rb_erase(node, &file_tree);
+//		printf("freeing %s, ino %lu\n", entry->name, entry->ino);
+		free(entry->name);
+		free(entry);
+	}
+}
+
+static void verify_match(struct file_entry *fe)
+{
+	struct file_entry *orig = fe;
+	struct file *file;
+	char *buffer;
+	char *temp;
+	ssize_t bytes;
+	int fd;
+
+	buffer = malloc(buffer_len);
+	if (!buffer)
+		return;
+
+	temp = malloc(buffer_len);
+	if (!temp) {
+		free(buffer);
+		return;
+	}
+
+	file = file_tree_search(fe->ino);
+	if (!file) {
+		fprintf(stderr, "couldn''t find file, weird %lu\n",
fe->ino);
+		goto out;
+	}
+
+	fd = open(file->name, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "failed to open%s\n", file->name);
+		goto out;
+	}
+
+	bytes = pread(fd, buffer, fe->len, fe->pos);
+	close(fd);
+	if (bytes < fe->len) {
+		fprintf(stderr, "short read\n");
+		goto out;
+	}
+
+	while (fe->next != NULL) {
+		struct file_entry *prev = fe;
+
+		fe = fe->next;
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldn''t find file, weird\n");
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		fd = open(file->name, O_RDONLY);
+		if (fd < 0) {
+			fprintf(stderr, "couldn''t open %s\n", file->name);
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		bytes = pread(fd, temp, fe->len, fe->pos);
+		close(fd);
+		if (bytes < fe->len) {
+			fprintf(stderr, "short read\n");
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+
+		if (memcmp(buffer, temp, fe->len)) {
+			fprintf(stderr, "manual compare doesn''t match: %s\n",
+				file->name);
+			prev->next = fe->next;
+			free(fe->hash);
+			free(fe);
+			orig->entries--;
+			fe = prev;
+			continue;
+		}
+	}
+
+	if (!orig->entries) {
+		rb_erase(&orig->n, &hash_tree);
+		free(orig->hash);
+		free(orig);
+	}
+out:
+	free(buffer);
+	free(temp);
+}
+
+static void prune_entries()
+{
+	struct rb_node *node;
+
+	node = rb_first(&hash_tree);
+	while (node) {
+		struct file_entry *entry;
+		entry = rb_entry(node, struct file_entry, n);
+		if (entry->entries) {
+			node = rb_next(node);
+			verify_match(entry);
+			continue;
+		}
+
+		node = rb_next(node);
+		rb_erase(&entry->n, &hash_tree);
+		free(entry->hash);
+		free(entry);
+	}
+}
+
+static void print_hashes()
+{
+	struct rb_node *hash_node;
+
+	for (hash_node = rb_first(&hash_tree); hash_node;
+	     hash_node = rb_next(hash_node)) {
+		struct file_entry *fe;
+		struct file *file;
+
+		fe = rb_entry(hash_node, struct file_entry, n);
+
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+			break;
+		}
+
+		print_hash((uint64_t *)fe->hash);
+		printf("\n");
+		printf("\t%s: pos=%lu, phys=%lu, len=%lu\n", file->name,
+		       fe->pos, fe->physical, fe->len);
+
+		while (fe->next != NULL) {
+			fe = fe->next;
+			file = file_tree_search(fe->ino);
+			if (!file) {
+				fprintf(stderr, "couldnt find file, weird\n");
+				continue;
+			}
+
+			printf("\t%s: pos=%lu, len=%lu\n", file->name,
+			       fe->pos, fe->len);
+		}
+	}
+}
+
+
+static void do_dedup()
+{
+	struct rb_node *node;
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_file_extent_info *info;
+	size_t size;
+	int allocated_entries = 1;
+
+	size = sizeof(*args) + sizeof(*info);
+	args = malloc(size);
+	if (!args) {
+		fprintf(stderr, "error allocating ioctl arguments\n");
+		return;
+	}
+
+	memset(args, 0, size);
+
+	for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+		struct file_entry *fe, *orig;
+		struct file *file;
+		int fd;
+		int ret;
+
+		orig = fe = rb_entry(node, struct file_entry, n);
+
+		file = file_tree_search(fe->ino);
+		if (!file) {
+			fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+			break;
+		}
+
+		fd = open(file->name, O_WRONLY);
+		if (fd < 0) {
+			fprintf(stderr, "error opening file (%s) for dedup: %s\n",
+				file->name, strerror(errno));
+			continue;
+		}
+
+		if (fe->entries > allocated_entries) {
+			allocated_entries = fe->entries;
+			size = sizeof(*args) +
+				(sizeof(*info) * allocated_entries);
+			args = realloc(args, size);
+			if (!args) {
+				fprintf(stderr, "error allocating ioctl "
+					"arguments\n");
+				return;
+			}
+			memset(args, 0, size);
+		}
+
+		args->total_files = 0;
+		args->logical_offset = fe->pos;
+		args->length = fe->len;
+		args->hash_len = digest_len;
+		args->hash_type = BTRFS_SAME_EXTENT_HASH_SHA256;
+		args->hash = fe->hash;
+
+		info = &args->info[0];
+		while (fe->next != NULL) {
+			fe = fe->next;
+			file = file_tree_search(fe->ino);
+			if (!file) {
+				fprintf(stderr, "couldnt find file, weird\n");
+				continue;
+			}
+
+			fe->fd = info->fd = open(file->name, O_WRONLY);
+			if (info->fd < 0) {
+				fprintf(stderr, "error opening file (%s) for "
+					"dedup: %s\n", file->name,
+					strerror(errno));
+				continue;
+			}
+			info->logical_offset = fe->pos;
+			info++;
+			args->total_files++;
+		}
+
+		if (args->total_files == 0) {
+			close(fd);
+			continue;
+		}
+
+		ret = ioctl(fd, BTRFS_IOC_FILE_EXTENT_SAME, args);
+		if (ret)
+			fprintf(stderr, "failed to do extent same %d\n",
+				errno);
+
+		fe = orig;
+		while (fe->next != NULL) {
+			fe = fe->next;
+			if (fe->fd >= 0)
+				close(fe->fd);
+		}
+		close(fd);
+	}
+}
+
+static void scan_dir(char *dirname)
+{
+	DIR *dir;
+	struct dirent *dirent;
+
+	dir = opendir(dirname);
+	if (!dir) {
+		fprintf(stderr, "failed to open dir %s: %d\n", dirname, errno);
+		return;
+	}
+
+	while (((dirent = readdir(dir)) != NULL)) {
+		char name[PATH_MAX];
+		if (dirent->d_type == DT_DIR) {
+			if (dirent->d_name[0] == ''.'')
+				continue;
+			snprintf(name, PATH_MAX, "%s/%s", dirname,
+				 dirent->d_name);
+			scan_dir(name);
+		} else if (dirent->d_type == DT_REG) {
+			snprintf(name, PATH_MAX, "%s/%s", dirname,
+				 dirent->d_name);
+			hash_file(name);
+		}
+	}
+
+	closedir(dir);
+}
+
+int main(int argc, char **argv)
+{
+	if (argc < 2) {
+		fprintf(stderr, "dedup <directory>\n");
+		exit(1);
+	}
+
+	if (!gcry_check_version(GCRYPT_VERSION)) {
+		fprintf(stderr, "libgcrypt version mismatch\n");
+		exit(1);
+	}
+
+	gcry_control(GCRYCTL_DISABLE_SECMEM, 0);
+	gcry_control(GCRYCTL_INITIALIZATION_FINISHED, 0);
+
+	digest_len = gcry_md_get_algo_dlen(GCRY_MD_SHA256);
+	digest = malloc(digest_len);
+	if (!digest) {
+		fprintf(stderr, "failed to alloc digest\n");
+		exit(1);
+	}
+
+	hash_tree = RB_ROOT;
+	file_tree = RB_ROOT;
+
+	scan_dir(argv[1]);
+
+	print_stats();
+
+	prune_entries();
+
+	print_hashes();
+
+	do_dedup();
+
+	free_hash_tree();
+	free_file_tree();
+	free(digest);
+
+	return 0;
+}
diff --git a/ioctl.h b/ioctl.h
index 4ace64f..646837b 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -138,6 +138,23 @@ struct btrfs_ioctl_disk_info_args {
 	__u64 devices[0];
 };
 
+#define BTRFS_SAME_EXTENT_HASH_SHA256	1
+
+struct btrfs_ioctl_file_extent_info {
+	__s64 fd;
+	__u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;
+	__u64 length;
+	__u64 total_files;
+	__u32 hash_len;
+	__u8 hash_type;
+	char *hash;
+	struct btrfs_ioctl_file_extent_info info[0];
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -177,4 +194,6 @@ struct btrfs_ioctl_disk_info_args {
 				    struct btrfs_ioctl_space_args)
 #define BTRFS_IOC_DISK_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21, \
 				  struct btrfs_ioctl_disk_info_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+					struct btrfs_ioctl_same_args)
 #endif
-- 
1.6.6.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote:> Basically I think online dedup is huge waste of time and completely useless.I couldn''t disagree more. First, let''s consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? The only sane way to keep track of hashes of existing blocks is using an index. Searches through an index containing evenly distributed data (such as hashes) is pretty fast (log(N)), and this has to be done regardless of whether the dedupe is online or offline. It also goes without saying that all the blocks being deduplicated need to be hashed, and the cost of this is also the same whether the block is hashes online or offline. Let''s look at the relative merits: 1a) Offline We have to copy the entire data set. This means we are using the full amount of disk writes that the data set size dictates. Do we do the hashing of current blocks at this point to create the indexes? Or do we defer it until some later time? Doing it at the point of writes is cheaper - we already have the data in RAM and we can calculate the hashes as we are writing each block. Performance implications of this are fairly analogous to the parity RAID RMW performance issue - to achieve decent performance you have to write the parity at the same time as the rest of the stripe, otherwise you have to read the part of the stripe you didn''t write, before calculating the checksum. So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. How is this in any way advantageous? 1b) Online As we are writing the data, we calculate the hashes for each block. (See 1a for argument of why I believe this is saner and cheaper than doing it offline.) Since we already have these hashes, we can do a look-up in the hash-index, and either write out the block as is (if that hash isn''t already in the index) or simply write the pointer to an existing suitable block (if it already exists). This saves us writing out that block - fewer writes to the disk, not to mention we don''t later have to re-read the block to dedupe it. So in this case, instead of write-read-relink of the offline scenario, we simply do relink on duplicate blocks. There is another reason to favour the online option due to it''s lower write stress - SSDs. Why hammer the SSD with totally unnecessary writes? The _only_ reason to defer deduping is that hashing costs CPU time. But the chances are that a modern CPU core can churn out MD5 and/or SHA256 hashes faster than a modern mechanical disk can keep up. A 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will struggle to keep up with. I don''t think that the realtime performance argument withstands scrutiny.> You are going to want to do different things with different data. For example, > for a mailserver you are going to want to have very small blocksizes, but for > say a virtualization image store you are going to want much larger blocksizes. > And lets not get into heterogeneous environments, those just get much too > complicated.In terms of deduplication, IMO it should really all be uniform, transparent, and block based. In terms of specifying which subtrees to dedupe, that should really be a per subdirectory hereditary attribute, kind of like compression was supposed to work with chattr +c in the past.> So my solution is batched dedup, where a user just runs this > command and it dedups everything at this point. This avoids the very costly > overhead of having to hash and lookup for duplicate extents online and lets us > be _much_ more flexible about what we want to deduplicate and how we want to do > it.I don''t see that it adds any flexibility compared to the hereditary deduping attribute. I also don''t see that it is any cheaper. It''s actually more expensive, according to the reasoning above. As an aside, zfs and lessfs both do online deduping, presumably for a good reason. Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html Personally, I would still like to see a fl-cow like solution that actually preserves the inode numbers of duplicate files while providing COW functionality that breaks this unity (and inode number identity) upon writes, specifically because it saves page cache (only have to cache one copy) and in case of DLLs on chroot style virtualization (OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are all mapped into the same memory, thus yielding massive memory savings on machines with a lot of VMs. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote:> This adds the ability for userspace to tell btrfs which extents match > eachother. You pass in > > -a logical offset > -a length > -a hash type (currently only sha256 is supported) > -the hash > -a list of file descriptors with their logical offset > > and this ioctl will split up the extent on the target file and then link > all of > the files with the target files extent and free up their original extent. > The hash is to make sure nothing has changed between the userspace app > running and we doing the actual linking, so we hash everything to make > sure it''s all still > the same. This doesn''t work in a few key cases > > 1) Any data transformation whatsoever. This includes compression or any > encryption that happens later on. This is just to make sure we''re not > deduping things that don''t turn out to be the same stuff on disk as it is > uncompressed/decrypted. > > 2) Across subvolumes. This can be fixed later, but this is just to keep > odd problems from happening, like oh say trying to dedup things that are > snapshots > of eachother already. Nothing bad will happen, it''s just needless work so > just don''t allow it for the time being. > > 3) If the target file''s data is split across extents. We need one extent > to point everybody at, so if the target file''s data spans different > extents we > won''t work. In this case I return ERANGE so the userspace app can call > defrag and then try again, but currently I don''t do that, so that will > have to be fixed at some point. > > I think thats all of the special cases. Thanks, >I''m going to ask the stupid question: What happens if an attacker user can race against the dedupe process? In particular, consider the following hypothetical scenario: Attacker has discovered a hash collision for some important data that they can read but not write (e.g. /etc/passwd, /home/user/company-critical- data.ods). They copy the important data from its original location to somewhere they can write to on the same filesystem. Now for the evil bit; they wait, watching for the dedupe process to run. When it''s had time to verify hash and memcmp the data, but before it calls the ioctl, Attacker swaps the copy of the data under their control for the bad one with the hash collision. If I''ve understood the code correctly, if Attacker''s version of the data is the source from the perspective of the ioctl, the kernel will hash the data, determine that the hash matches, not cross-check the entire extent with memcmp or equivalent, and will then splice Attacker''s version of data into the original file. If the collision merely lets Attacker trash the original, that''s bad enough; if it lets them put other interesting content in place, it''s a really bad problem. -- Here''s hoping I simply missed something, Simon Farnsworth -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:> So by doing the hash indexing offline, the total amount of disk I/O > required effectively doubles, and the amount of CPU spent on doing the > hashing is in no way reduced.But there are people who might want to avoid temporally the extra cost of online dedup, and do it offline when the server load is smaller. In my opinion, both online and offline dedup have valid use cases, and the best choice is probably implement both. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: > > So by doing the hash indexing offline, the total amount of disk I/O > > required effectively doubles, and the amount of CPU spent on doing the > > hashing is in no way reduced. > > But there are people who might want to avoid temporally the extra cost > of online dedup, and do it offline when the server load is smaller. > > In my opinion, both online and offline dedup have valid use cases, and > the best choice is probably implement both.Question from an end-user. When we say "offline" deduplication, are we talking about post-process deduplication (a la what Data ONTAP does with their SIS implementation) during which the underlying file system data continues to be available, or a process that needs exclusive access ot the blocks to do its job? Ray -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 05, 2011 at 05:42:42PM +0000, Gordan Bobic wrote:> Josef Bacik wrote: > >> Basically I think online dedup is huge waste of time and completely useless. > > I couldn''t disagree more. First, let''s consider what is the > general-purpose use-case of data deduplication. What are the resource > requirements to perform it? How do these resource requirements differ > between online and offline? > > The only sane way to keep track of hashes of existing blocks is using an > index. Searches through an index containing evenly distributed data > (such as hashes) is pretty fast (log(N)), and this has to be done > regardless of whether the dedupe is online or offline. It also goes > without saying that all the blocks being deduplicated need to be hashed, > and the cost of this is also the same whether the block is hashes online > or offline. > > Let''s look at the relative merits: > > 1a) Offline > We have to copy the entire data set. This means we are using the full > amount of disk writes that the data set size dictates. Do we do the > hashing of current blocks at this point to create the indexes? Or do we > defer it until some later time? > > Doing it at the point of writes is cheaper - we already have the data in > RAM and we can calculate the hashes as we are writing each block. > Performance implications of this are fairly analogous to the parity RAID > RMW performance issue - to achieve decent performance you have to write > the parity at the same time as the rest of the stripe, otherwise you > have to read the part of the stripe you didn''t write, before calculating > the checksum. > > So by doing the hash indexing offline, the total amount of disk I/O > required effectively doubles, and the amount of CPU spent on doing the > hashing is in no way reduced. > > How is this in any way advantageous? > > 1b) Online > As we are writing the data, we calculate the hashes for each block. (See > 1a for argument of why I believe this is saner and cheaper than doing it > offline.) Since we already have these hashes, we can do a look-up in the > hash-index, and either write out the block as is (if that hash isn''t > already in the index) or simply write the pointer to an existing > suitable block (if it already exists). This saves us writing out that > block - fewer writes to the disk, not to mention we don''t later have to > re-read the block to dedupe it. > > So in this case, instead of write-read-relink of the offline scenario, > we simply do relink on duplicate blocks. > > There is another reason to favour the online option due to it''s lower > write stress - SSDs. Why hammer the SSD with totally unnecessary writes? > > The _only_ reason to defer deduping is that hashing costs CPU time. But > the chances are that a modern CPU core can churn out MD5 and/or SHA256 > hashes faster than a modern mechanical disk can keep up. A 15,000rpm > disk can theoretically handle 250 IOPS. A modern CPU can handle > considerably more than 250 block hashings per second. You could argue > that this changes in cases of sequential I/O on big files, but a 1.86GHz > GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will > struggle to keep up with. > > I don''t think that the realtime performance argument withstands scrutiny. > >> You are going to want to do different things with different data. For example, >> for a mailserver you are going to want to have very small blocksizes, but for >> say a virtualization image store you are going to want much larger blocksizes. >> And lets not get into heterogeneous environments, those just get much too >> complicated. > > In terms of deduplication, IMO it should really all be uniform, > transparent, and block based. In terms of specifying which subtrees to > dedupe, that should really be a per subdirectory hereditary attribute, > kind of like compression was supposed to work with chattr +c in the past. > >> So my solution is batched dedup, where a user just runs this >> command and it dedups everything at this point. This avoids the very costly >> overhead of having to hash and lookup for duplicate extents online and lets us >> be _much_ more flexible about what we want to deduplicate and how we want to do >> it. > > I don''t see that it adds any flexibility compared to the hereditary > deduping attribute. I also don''t see that it is any cheaper. It''s > actually more expensive, according to the reasoning above. > > As an aside, zfs and lessfs both do online deduping, presumably for a > good reason. > > Then again, for a lot of use-cases there are perhaps better ways to > achieve the targed goal than deduping on FS level, e.g. snapshotting or > something like fl-cow: > http://www.xmailserver.org/flcow.html > > Personally, I would still like to see a fl-cow like solution that > actually preserves the inode numbers of duplicate files while providing > COW functionality that breaks this unity (and inode number identity) > upon writes, specifically because it saves page cache (only have to > cache one copy) and in case of DLLs on chroot style virtualization > (OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are > all mapped into the same memory, thus yielding massive memory savings on > machines with a lot of VMs. >Blah blah blah, I''m not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. The only reason I did offline dedup was because I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I''d save 300 megabytes. I''ll say that again, with 107 gigabytes in use, I''d save 300 megabytes. So in the normal user case dedup would have been wholey useless to me. Dedup is only usefull if you _know_ you are going to have duplicate information, so the two major usecases that come to mind are 1) Mail server. You have small files, probably less than 4k (blocksize) that you are storing hundreds to thousands of. Using dedup would be good for this case, and you''d have to have a small dedup blocksize for it to be usefull. 2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are you are going to share data between them, but unlike with the mail server example, you are likely to find much larger chunks that are the same, so you''d want a larger dedup blocksize, say 64k. You want this because if you did just 4k you''d end up with a ridiculous amount of framentation and performance would go down the toilet, so you need a larger dedup blocksize to make for better performance. So you''d want an online implementation to give you a choice of dedup blocksize, which seems to me to be overly complicated. And then lets bring up the fact that you _have_ to manually compare any data you are going to dedup. I don''t care if you think you have the greatest hashing algorithm known to man, you are still going to have collisions somewhere at some point, so in order to make sure you don''t lose data, you have to manually memcmp the data. So if you are doing this online, that means reading back the copy you want to dedup in the write path so you can do the memcmp before you write. That is going to make your write performance _suck_. Do I think offline dedup is awesome? Hell no, but I got distracted doing it as a side project so I figured I''d finish it, and I did it in under 1400 lines. I dare you to do the same with an online implementation. Offline is simpler to implement and simpler to debug if something goes wrong, and has an overall easier to control impact on the system. So there is an entirely too long of a response for something I didn''t really want to get into. People are free to do an online implementation, and good luck to them, but as for me I think it''s stupid and won''t be taking it up anytime soon. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:> Blah blah blah, I''m not having an argument about which is better because I > simply do not care. I think dedup is silly to begin with, and online dedup even > sillier. The only reason I did offline dedup was because I was just toying > around with a simple userspace app to see exactly how much I would save if I did > dedup on my normal system, and with 107 gigabytes in use, I''d save 300 > megabytes. I''ll say that again, with 107 gigabytes in use, I''d save 300 > megabytes. So in the normal user case dedup would have been wholey useless to > me.I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That''s assuming no MD5 hash collisions.) I don''t have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. (For my script, see find-duplicate-chunks in http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/". http://braawi.org/obnam/ is the home page of the backup app.) -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik <josef@redhat.com> wrote:> Dedup is only usefull if you _know_ you are going to have duplicate information, > so the two major usecases that come to mind are > > 1) Mail server. You have small files, probably less than 4k (blocksize) that > you are storing hundreds to thousands of. Using dedup would be good for this > case, and you''d have to have a small dedup blocksize for it to be usefull. > > 2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are > you are going to share data between them, but unlike with the mail server > example, you are likely to find much larger chunks that are the same, so you''d > want a larger dedup blocksize, say 64k. You want this because if you did just > 4k you''d end up with a ridiculous amount of framentation and performance would > go down the toilet, so you need a larger dedup blocksize to make for better > performance.You missed out on the most obvious, and useful, use case for dedupe: central backups server. Our current backup server does an rsync backup of 127 servers every night into a single ZFS pool. 90+ of those servers are identical Debian installs (school servers), 20-odd of those are identical FreeBSD installs (firewalls/routers), and the rest are mail/web/db servers (Debian, Ubuntu, RedHat, Windows). Just as a test, we copied a week of backups to a Linux box running ZFS-fuse with dedupe enabled, and had a combined dedupe/compress ration in the low double-digits (11 or 12x, something like that). Now we''re just waiting for ZFSv22+ to hit FreeBSD to enable dedupe on the backups server. For backups, and central storage for VMs, online dedupe is a massive win. Offline, maybe. Either way, dedupe is worthwhile. -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 05, 2011 at 07:58:13PM +0000, Lars Wirzenius wrote:> On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: > > Blah blah blah, I''m not having an argument about which is better because I > > simply do not care. I think dedup is silly to begin with, and online dedup even > > sillier. The only reason I did offline dedup was because I was just toying > > around with a simple userspace app to see exactly how much I would save if I did > > dedup on my normal system, and with 107 gigabytes in use, I''d save 300 > > megabytes. I''ll say that again, with 107 gigabytes in use, I''d save 300 > > megabytes. So in the normal user case dedup would have been wholey useless to > > me. > > I have been thinking a lot about de-duplication for a backup application > I am writing. I wrote a little script to figure out how much it would > save me. For my laptop home directory, about 100 GiB of data, it was a > couple of percent, depending a bit on the size of the chunks. With 4 KiB > chunks, I would save about two gigabytes. (That''s assuming no MD5 hash > collisions.) I don''t have VM images, but I do have a fair bit of saved > e-mail. So, for backups, I concluded it was worth it to provide an > option to do this. I have no opinion on whether it is worthwhile to do > in btrfs. >Yeah for things where you are talking about sending it over the network or something like that every little bit helps. I think deduplication is far more interesting and usefull at an application level than at a filesystem level. For example with a mail server, there is a good chance that the files will be smaller than a blocksize and not be able to be deduped, but if the application that was storing them recognized that it had the same messages and just linked everything in its own stuff then that would be cool. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/05/2011 06:41 PM, Diego Calleja wrote:> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: >> So by doing the hash indexing offline, the total amount of disk I/O >> required effectively doubles, and the amount of CPU spent on doing the >> hashing is in no way reduced. > > But there are people who might want to avoid temporally the extra cost > of online dedup, and do it offline when the server load is smaller.The point is that the offline dedup is actually twice as expensive, and the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is very limited today, compared to CPU time. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/05/2011 07:01 PM, Ray Van Dolson wrote:> On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: >> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: >>> So by doing the hash indexing offline, the total amount of disk I/O >>> required effectively doubles, and the amount of CPU spent on doing the >>> hashing is in no way reduced. >> >> But there are people who might want to avoid temporally the extra cost >> of online dedup, and do it offline when the server load is smaller. >> >> In my opinion, both online and offline dedup have valid use cases, and >> the best choice is probably implement both. > > Question from an end-user. When we say "offline" deduplication, are we > talking about post-process deduplication (a la what Data ONTAP does > with their SIS implementation) during which the underlying file system > data continues to be available, or a process that needs exclusive > access ot the blocks to do its job?I was assuming it was a regular cron-job that grinds away on the disks but doesn''t require downtime. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote:> On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: > > On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: > > > So by doing the hash indexing offline, the total amount of disk I/O > > > required effectively doubles, and the amount of CPU spent on doing the > > > hashing is in no way reduced. > > > > But there are people who might want to avoid temporally the extra cost > > of online dedup, and do it offline when the server load is smaller. > > > > In my opinion, both online and offline dedup have valid use cases, and > > the best choice is probably implement both. > > Question from an end-user. When we say "offline" deduplication, are we > talking about post-process deduplication (a la what Data ONTAP does > with their SIS implementation) during which the underlying file system > data continues to be available, or a process that needs exclusive > access ot the blocks to do its job? >Yeah its just a post-process thing, you run it when you care to run it and it doesn''t make anything unavailable. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik <josef@redhat.com> wrote:> Yeah for things where you are talking about sending it over the network or > something like that every little bit helps. I think deduplication is far more > interesting and usefull at an application level than at a filesystem level. For > example with a mail server, there is a good chance that the files will be > smaller than a blocksize and not be able to be deduped, but if the application > that was storing them recognized that it had the same messages and just linked > everything in its own stuff then that would be cool. Thanks,Cyrus IMAP and Zimbra (probably a lot of others) already do that, hard-linking identical message bodies. The e-mail server use-case is for dedupe is pretty much covered already. -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/05/2011 07:46 PM, Josef Bacik wrote:> Blah blah blah, I''m not having an argument about which is better because I > simply do not care. I think dedup is silly to begin with, and online dedup even > sillier.Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is better.> The only reason I did offline dedup was because I was just toying > around with a simple userspace app to see exactly how much I would save if I did > dedup on my normal system, and with 107 gigabytes in use, I''d save 300 > megabytes. I''ll say that again, with 107 gigabytes in use, I''d save 300 > megabytes. So in the normal user case dedup would have been wholey useless to > me.Dedup isn''t for an average desktop user. Dedup is for backup storage and virtual images. I don''t remember anyone ever saying it is for the average desktop user. I am amazed you got that much saving even - I wouldn''t expect there to be any duplicate files on a normal system. Compression is a feature that the desktop users would benefit with, not deduplication.> Dedup is only usefull if you _know_ you are going to have duplicate information, > so the two major usecases that come to mind are > > 1) Mail server. You have small files, probably less than 4k (blocksize) that > you are storing hundreds to thousands of. Using dedup would be good for this > case, and you''d have to have a small dedup blocksize for it to be usefull.Explain to me why you think this would yield duplicate blocks. If your server is Maildir, headers will be in the mail files, and because all emails went to different users, they''d have different headers, and thus not be dedupable.> 2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are > you are going to share data between them, but unlike with the mail server > example, you are likely to find much larger chunks that are the same, so you''d > want a larger dedup blocksize, say 64k. You want this because if you did just > 4k you''d end up with a ridiculous amount of framentation and performance would > go down the toilet, so you need a larger dedup blocksize to make for better > performance.Fragmentation will cause you problems anyway, the argument in the UNIX world since year dot was that defragging doesn''t make a damn worth of difference when you have a hundred users hammering away on a machine that has to skip between all their collective files. If you have VM image files a-la vmware/xen/kvm, then using blocks of the same size as the guests is the only way that you are going to get sane deduplication performance. Otherwise the blocks won''t line up. If the dedupe block size is 4KB and guest fs block size is 4KB, that''s a reasonably clean case. The biggest win by far, however, would be when using chroot type guests, as I mentioned.> So you''d want an online implementation to give you a choice of dedup blocksize, > which seems to me to be overly complicated.I''d just make it always use the fs block size. No point in making it variable.> And then lets bring up the fact that you _have_ to manually compare any data you > are going to dedup. I don''t care if you think you have the greatest hashing > algorithm known to man, you are still going to have collisions somewhere at some > point, so in order to make sure you don''t lose data, you have to manually memcmp > the data. So if you are doing this online, that means reading back the copy you > want to dedup in the write path so you can do the memcmp before you write. That > is going to make your write performance _suck_.IIRC, this is configurable in ZFS so that you can switch off the physical block comparison. If you use SHA256, the probability of a collission (unless SHA is broken, in which case we have much bigger problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 Exabytes. That''s one trillion trillion (that''s double trillion) Exabytes. That is considerably more storage space than there is likely to be available on the planet for some time. And just for good measure, you could always up the hash to SHA512 or use two different hashes (e.g. a combination of SHA256 and MD5).> Do I think offline dedup is awesome? Hell no, but I got distracted doing it as > a side project so I figured I''d finish it, and I did it in under 1400 lines. I > dare you to do the same with an online implementation. Offline is simpler to > implement and simpler to debug if something goes wrong, and has an overall > easier to control impact on the system.It is also better done outside the FS if you''re not going to do it properly using FL-COW or fuse based lessfs. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On ke, 2011-01-05 at 19:58 +0000, Lars Wirzenius wrote:> (For my script, see find-duplicate-chunks in > http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the > current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/". > http://braawi.org/obnam/ is the home page of the backup app.)If I may add: it would perhaps be good to collect numbers on the amount of duplication (for various block sizes) there is on different kinds of systems: random laptops, file servers for small companies and large companies, mail servers, backup servers, VM servers, etc. Would anyone be interested in collecting such numbers? A script like mine would be a bit heavy to run, but not too much so, I bet. It would be good to have hard numbers as a basis of discussion rather than guesses and assumptions. Or perhaps someone''s already collected the data? -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Miércoles, 5 de Enero de 2011 21:25:41 Gordan Bobic escribió:> The point is that the offline dedup is actually twice as expensive, and > the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is > very limited today, compared to CPU time.And my point is:> But there are people who might want to avoid temporally the extra cost > of online dedup, and do it offline when the server load is smaller.In fact, there are cases where online dedup is clearly much worse. For example, cases where people suffer duplication, but it takes a lot of time (several months) to hit it. With online dedup, you need to enable it all the time to get deduplication, and the useless resource waste offsets the other advantages. With offline dedup, you only deduplicate when the system really needs it. And I can also imagine some unrealistic but theorically valid cases, like for example an embedded device that for some weird reason needs deduplication but doesn''t want online dedup because it needs to save as much power as possible. But it can run an offline dedup when the batteries are charging. It''s clear to me that if you really want a perfect deduplication solution you need both systems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/05/2011 09:14 PM, Diego Calleja wrote:> In fact, there are cases where online dedup is clearly much worse. For > example, cases where people suffer duplication, but it takes a lot of > time (several months) to hit it. With online dedup, you need to enable > it all the time to get deduplication, and the useless resource waste > offsets the other advantages. With offline dedup, you only deduplicate > when the system really needs it.My point is that on a file server you don''t need to worry about the CPU cost of deduplication because you''ll run out of I/O long before you run out of CPU.> And I can also imagine some unrealistic but theorically valid cases, > like for example an embedded device that for some weird reason needs > deduplication but doesn''t want online dedup because it needs to save > as much power as possible. But it can run an offline dedup when the > batteries are charging.That''s _very_ theoretical.> It''s clear to me that if you really want a perfect deduplication > solution you need both systems.I''m not against having both. :) Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/06/2011 12:22 AM, Spelic wrote:> On 01/05/2011 09:46 PM, Gordan Bobic wrote: >> On 01/05/2011 07:46 PM, Josef Bacik wrote: >> >> Offline dedup is more expensive - so why are you of the opinion that >> it is less silly? And comparison by silliness quotiend still sounds >> like an argument over which is better. >> > > If I can say my opinion, I wouldn''t want dedup to be enabled online for > the whole filesystem. > > Three reasons: > > 1- Virtual machine disk images should not get deduplicated imho, if you > care about performances, because fragmentation is more important in that > case.I disagree. You''ll gain much, much more from improved caching and reduced page cache usage than you''ll lose from fragmentation.> So offline dedup is preferable IMHO. Or at least online dedup should > happen only on configured paths.Definitely agree that it should be a per-directory option, rather than per mount.> 2- I don''t want performances to drop all the time. I would run dedup > periodically on less active hours, hence, offline. A rate limiter should > also be implemented so not to trash the drives too much. Also a stop and > continue should be implemented, so that dedup which couldn''t finish > within a certain time-frame (e.g. one night) can be made continue the > night after without restarting from the beginning.This is the point I was making - you end up paying double the cost in disk I/O and the same cost in CPU terms if you do it offline. And I am not convniced the overhead of calculating checksums is that great. There are already similar overheads in checksums being calculated to enable smart data recovery in case of silent disk corruption. Now that I mentioned, that, it''s an interesting point. Could these be unified? If we crank up the checksums on files a bit, to something suitably useful for deduping, it could make the deduping feature almost free. As for restarting deduping (e.g. you chattr -R a directory to specify it for deduping. Since the contents aren''t already deduped (the files'' entries aren''t in the hash index, it''d be obvious what still needs to be deduped and what doesn''t.> 3- Only some directories should be deduped, for performance reasons. You > can foresee where duplicate blocks can exist and where not. Backup > directories typically, or mailservers directories. The rest is probably > a waste of time.Indeed, see above. I think it should be a per file setting/attribute, hereditary from the parent directory.> Also, the OS is small even if identical on multiple virtual images, how > much is going to occupy anyway? Less than 5GB per disk image usually. > And that''s the only thing that would be deduped because data likely to > be different on each instance. How many VMs running you have? 20? That''s > at most 100GB saved one-time at the cost of a lot of fragmentation.That''s also 100GB fewer disk blocks in contention for page cache. If you''re hitting the disks, you''re already going to slow down by several orders of magnitude. Better to make the caching more effective.>>> So if you are doing this online, that means reading back the copy you >>> want to dedup in the write path so you can do the memcmp before you >>> write. That >>> is going to make your write performance _suck_. >> >> IIRC, this is configurable in ZFS so that you can switch off the >> physical block comparison. If you use SHA256, the probability of a >> collission (unless SHA is broken, in which case we have much bigger >> problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 >> Exabytes. That''s one trillion trillion (that''s double trillion) Exabytes. > > I like mathematics, but I don''t care this time. I would never enable > dedup without full blocks compare. I think most users and most companies > would do the same.I understand where you are coming from, but by that reasoning you could also argue that AES256 isn''t good enough to keep your data confidential. It is a virtual certainty that you will lose several times that much data through catastrophic disk+raid+backup failures than through finding a hash collission.> If there is full blocks compare, a simpler/faster algorithm could be > chosen, like md5. Or even a md-64bits which I don''t think it exists, but > you can take MD4 and then xor the first 8 bytes with the second 8 bytes > so to reduce it to 8 bytes only. This is just because it saves 60% of > the RAM occupation during dedup, which is expected to be large, and the > collisions are still insignificant at 64bits. Clearly you need to do > full blocks compare after that.I really don''t think the cost in terms of a few bytes per file for the hashes is that significant.> Note that deduplication IS a cryptographically sensitive matter because > if sha-1 is cracked, people can nuke (or maybe even alter, and with > this, hack privileges) other users'' files by providing blocks with the > same SHA and waiting for dedup to pass. > Same thing for AES btw, it is showing weaknesses: use blowfish or twofish. > SHA1 and AES are two wrong standards...That''s just alarmist. AES is being cryptanalyzed because everything uses it. And the news of it''s insecurity are somewhat exaggerated (for now at least).> Dedup without full blocks compare seems indeed suited for online dedup > (which I wouldn''t enable, now for one more reason) because with full > block compares performances would really suck. But please leave full > blocks compare for the offline dedup.Actually, even if you are doing full block compares, online would still be faster, because at least one copy will already be in page cache, ready to hash. Online you get checksum+read+write, offline you get read+checksum+read+write. You still end up 1/3 ahead in terms if IOPS required.> Also I could suggest a third type of deduplication, but this is > harder... it''s a file-level deduplication which works like xdelta, that > is, it is capable to recognize piece of identical data on two files, > which are not at the same offset and which are not even aligned at block > boundary. For this, a rolling hash like the one of rsync, or the xdelta > 3.0 algorithm could be used. For this to work I suppose Btrfs needs to > handle the padding of filesystem blocks... which I''m not sure it was > foreseen.I think you''ll find this is way too hard to do sensibly. You are almost doing a rzip pass over the whole file system. I don''t think it''s really workable.> Above in this thread you said: >> The _only_ reason to defer deduping is that hashing costs CPU time. >> But the chances are that a modern CPU core can churn out MD5 and/or >> SHA256 hashes faster than a modern mechanical disk can keep up. A >> 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can >> handle considerably more than 250 block hashings per second. You could >> argue that this changes in cases of sequential I/O on big files, but a >> 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs >> will struggle to keep up with. > > A normal 1TB disk with platters can do 130MB/sec sequential, no problems. > A SSD can do more like 200MB/sec write 280MB/sec read sequential or > random and is actually limited only by the SATA 3.0gbit/sec but soon > enough they will have SATA/SAS 6.0gbit/sec.But if you are spewing that much sequential data all the time, your workload is highly unusual, not to mention that those SSDs won''t last a year. And if you are streaming live video or have a real-time data logging application that generates that much data, the chances are that yuo won''t have gained anything from deduping anyway. I don''t think it''s a valid use case, at least until you can come up with at least a remotely realistic scenario where you might get plausible benefit from deduping in terms of space savings that involves sequentially streaming data to disk at full speed.> More cores can be used for hashing but multicore implementation for > stuff that is not natively threaded (such as parallel and completely > separate queries to a DB) usually is very difficult to do well. E.g. it > was attempted recently on MD raid for parity computation by > knowledgeable people but it performed so much worse than single-core > that it was disabled.You''d need a very fat array for one core to be unable to keep up. According to my dmesg, RAID5 checksumming on the box on my desk tops out at 1.8GB/s, and RAID6 at 1.2GB/s. That''s a lot of disks'' worth of bandwidth to have in an array, and that''s assuming large, streaming writes that can be handled efficiently. In reality, on smaller writes you''ll find you are severely bottlenecked by disk seek times, even if you have carefully tuned your MD and file system parameters to perfection. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/05/2011 09:46 PM, Gordan Bobic wrote:> On 01/05/2011 07:46 PM, Josef Bacik wrote: > > Offline dedup is more expensive - so why are you of the opinion that > it is less silly? And comparison by silliness quotiend still sounds > like an argument over which is better. >If I can say my opinion, I wouldn''t want dedup to be enabled online for the whole filesystem. Three reasons: 1- Virtual machine disk images should not get deduplicated imho, if you care about performances, because fragmentation is more important in that case. So offline dedup is preferable IMHO. Or at least online dedup should happen only on configured paths. 2- I don''t want performances to drop all the time. I would run dedup periodically on less active hours, hence, offline. A rate limiter should also be implemented so not to trash the drives too much. Also a stop and continue should be implemented, so that dedup which couldn''t finish within a certain time-frame (e.g. one night) can be made continue the night after without restarting from the beginning. 3- Only some directories should be deduped, for performance reasons. You can foresee where duplicate blocks can exist and where not. Backup directories typically, or mailservers directories. The rest is probably a waste of time.> Dedup isn''t for an average desktop user. Dedup is for backup storage > and virtual imagesNot virtual images imho, for the reason above. Also, the OS is small even if identical on multiple virtual images, how much is going to occupy anyway? Less than 5GB per disk image usually. And that''s the only thing that would be deduped because data likely to be different on each instance. How many VMs running you have? 20? That''s at most 100GB saved one-time at the cost of a lot of fragmentation. Now if you backup those images periodically, that''s a place where I would run dedup.> I''d just make it always use the fs block size. No point in making it > variable.Agreed. What is the reason for variable block size?>> And then lets bring up the fact that you _have_ to manually compare >> any data you >> are going to dedup. I don''t care if you think you have the greatest >> hashing >> algorithm known to man, you are still going to have collisions >> somewhere at some >> point, so in order to make sure you don''t lose data, you have to >> manually memcmp >> the data.Totally agreed>> So if you are doing this online, that means reading back the copy you >> want to dedup in the write path so you can do the memcmp before you >> write. That >> is going to make your write performance _suck_. > > IIRC, this is configurable in ZFS so that you can switch off the > physical block comparison. If you use SHA256, the probability of a > collission (unless SHA is broken, in which case we have much bigger > problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 > Exabytes. That''s one trillion trillion (that''s double trillion) Exabytes.I like mathematics, but I don''t care this time. I would never enable dedup without full blocks compare. I think most users and most companies would do the same. If there is full blocks compare, a simpler/faster algorithm could be chosen, like md5. Or even a md-64bits which I don''t think it exists, but you can take MD4 and then xor the first 8 bytes with the second 8 bytes so to reduce it to 8 bytes only. This is just because it saves 60% of the RAM occupation during dedup, which is expected to be large, and the collisions are still insignificant at 64bits. Clearly you need to do full blocks compare after that. BTW if you want to allow (as an option) dedup without full blocks compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has problems, I almost wouldn''t suggest it for cryptographically secure stuff foreseeing the future. Use ripemd160 or even better ripemd256 which is even faster according to http://www.cryptopp.com/benchmarks.html ripemds are much better algorithms than shas: they have no known weaknesses. Note that deduplication IS a cryptographically sensitive matter because if sha-1 is cracked, people can nuke (or maybe even alter, and with this, hack privileges) other users'' files by providing blocks with the same SHA and waiting for dedup to pass. Same thing for AES btw, it is showing weaknesses: use blowfish or twofish. SHA1 and AES are two wrong standards... Dedup without full blocks compare seems indeed suited for online dedup (which I wouldn''t enable, now for one more reason) because with full block compares performances would really suck. But please leave full blocks compare for the offline dedup. Also I could suggest a third type of deduplication, but this is harder... it''s a file-level deduplication which works like xdelta, that is, it is capable to recognize piece of identical data on two files, which are not at the same offset and which are not even aligned at block boundary. For this, a rolling hash like the one of rsync, or the xdelta 3.0 algorithm could be used. For this to work I suppose Btrfs needs to handle the padding of filesystem blocks... which I''m not sure it was foreseen. Above in this thread you said:> The _only_ reason to defer deduping is that hashing costs CPU time. > But the chances are that a modern CPU core can churn out MD5 and/or > SHA256 hashes faster than a modern mechanical disk can keep up. A > 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can > handle considerably more than 250 block hashings per second. You could > argue that this changes in cases of sequential I/O on big files, but a > 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs > will struggle to keep up with.A normal 1TB disk with platters can do 130MB/sec sequential, no problems. A SSD can do more like 200MB/sec write 280MB/sec read sequential or random and is actually limited only by the SATA 3.0gbit/sec but soon enough they will have SATA/SAS 6.0gbit/sec. More cores can be used for hashing but multicore implementation for stuff that is not natively threaded (such as parallel and completely separate queries to a DB) usually is very difficult to do well. E.g. it was attempted recently on MD raid for parity computation by knowledgeable people but it performed so much worse than single-core that it was disabled. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Gordan Bobic''s message of 2011-01-05 12:42:42 -0500:> Josef Bacik wrote: > > > Basically I think online dedup is huge waste of time and completely useless. > > I couldn''t disagree more. First, let''s consider what is the > general-purpose use-case of data deduplication. What are the resource > requirements to perform it? How do these resource requirements differ > between online and offline?I don''t really agree with Josef that dedup is dumb, but I do think his current approach is the most reasonable. Dedup has a few very valid use cases, which I think break down to: 1) backups 2) VM images. The backup farm use case is the best candidate for dedup in general because they are generally write once and hopefully read never. Fragmentation for reading doesn''t matter at all and we''re really very sure we''re going to backup the same files over and over again. But, it''s also something that will be dramatically more efficient when the backup server helps out. The backup server knows two files have the same name, same size and can guess with very high accuracy that they will be the same. So it is a very good candidate for Josef''s offline dedup because it can just do the dedup right after writing the file. In the backup farm, whole files are very likely to be identical, which again is very easy to optimize with Josef''s approach. Next is the VM images. This is actually a much better workload for online dedup, except for the part where our poor storage server would be spending massive amounts of CPU deduping blocks for all the VMs on the machine. In this case the storage server doesn''t know the filenames, it just has bunches of blocks that are likely to be the same across VMs. So, it seems a bit silly to do this out of band, where we wander through the FS and read a bunch of blocks in hopes of finding ones with the same hash. But, one of the things on our features-to-implement page is to wander through the FS and read all the blocks from time to time. We want to do this in the background to make sure the bits haven''t rotted on disk. By scrubbing from time to time we are able to make sure that when a disk does die, other disks in the FS are likely to have a good copy. So again, Josef''s approach actually works very well. His dedup util could be the scrubbing util and we''ll get two projects for the price of one. As for the security of hashes, we''re unlikely to find a collision on a sha256 that wasn''t made maliciously. If the system''s data is controlled and you''re not worried about evil people putting files on there, extra reads really aren''t required. But then again, extra reads are a good thing (see above about scrubbing). The complexity of the whole operation goes down dramatically when we do the verifications because hash index corruptions (this extent has this hash) will be found instead of blindly trusted. None of this means that online dedup is out of the question, I just think the offline stuff is a great way to start. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/06/2011 02:03 AM, Gordan Bobic wrote:> > That''s just alarmist. AES is being cryptanalyzed because everything > uses it. And the news of it''s insecurity are somewhat exaggerated (for > now at least).Who cares... the fact of not being much used is a benefit for RIPEMD / blowfish-twofish then. Nobody makes viruses for Linux because they target windows. Same thing... RIPEMD has still an advantage over SHA imho, and blowfish over AES.> >> If there is full blocks compare, a simpler/faster algorithm could be >> chosen, like md5. Or even a md-64bits which I don''t think it exists, but >> you can take MD4 and then xor the first 8 bytes with the second 8 bytes >> so to reduce it to 8 bytes only. This is just because it saves 60% of >> the RAM occupation during dedup, which is expected to be large, and the >> collisions are still insignificant at 64bits. Clearly you need to do >> full blocks compare after that. > > I really don''t think the cost in terms of a few bytes per file for the > hashes is that significant.20 to 8 = 12 bytes per *filesystem block* saved, I think Aren''t we talking about block-level deduplication? For every TB of filesystem you occupy 2GB of RAM with hashes instead of 5.3GB (I am assuming 4K blocks, I don''t remember how big are btrfs blocks) For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might be the edge between feasible and not feasible. Actually it might not be feasible anyway... an option to store hashes into a ssd should be provided then... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic <gordan@bobich.net> wrote:> On 01/06/2011 12:22 AM, Spelic wrote: > Definitely agree that it should be a per-directory option, rather than per > mount.JOOC, would the dedupe "table" be done per directory, per mount, per sub-volume, or per volume? The larger the pool of data to check against, the better your dedupe ratios will be. I''m not up-to-date on all the terminology that btrfs uses, and how it compares to ZFS (disks -> vdevs -> pool -> filesystem/volume), so the terms above may be incorrect. :) In the ZFS world, dedupe is done pool-wide in that any block in the pool is a candidate for dedupe, but the dedupe property can be enabled/disabled on a per-filesystem basis. Thus, only blocks in filesystems with the dedupe property enabled will be deduped. But blocks from any filesystem can be compared against.> This is the point I was making - you end up paying double the cost in disk > I/O and the same cost in CPU terms if you do it offline. And I am not > convniced the overhead of calculating checksums is that great. There are > already similar overheads in checksums being calculated to enable smart data > recovery in case of silent disk corruption. > > Now that I mentioned, that, it''s an interesting point. Could these be > unified? If we crank up the checksums on files a bit, to something suitably > useful for deduping, it could make the deduping feature almost free.This is what ZFS does. Every block in the pool has a checksum attached to it. Originally, the default algorithm was fletcher2, with fletcher4 and sha256 as alternates. When dedupe was enabled, the default was changed to fletcher4. Dedupe also came with the option to enable/disable a byte-for-byte verify when the hashes match. By switching the checksum algorithm for the pool to sha256 ahead of time, you can enable dedupe, and get the dedupe checksumming for free. :)>> Also, the OS is small even if identical on multiple virtual images, how >> much is going to occupy anyway? Less than 5GB per disk image usually. >> And that''s the only thing that would be deduped because data likely to >> be different on each instance. How many VMs running you have? 20? That''s >> at most 100GB saved one-time at the cost of a lot of fragmentation. > > That''s also 100GB fewer disk blocks in contention for page cache. If you''re > hitting the disks, you''re already going to slow down by several orders of > magnitude. Better to make the caching more effective.If you setup your VMs as diskless images, using NFS off a storage server running <whatever FS> using dedupe, you can get a lot more out of it than using disk image files (where you have all the block sizes and alignment to worry about). And the you can use all the fancy snapshotting, cloning, etc features of <whatever FS> as well. -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:> > I''d just make it always use the fs block size. No point in making it > > variable. > > Agreed. What is the reason for variable block size?First post on this list - I mostly was just reading so far to learn more on fs design but this is one topic I (unfortunately) have experience with... You wouldn''t believe the difference variable block size dedupe makes. For a pure fileserver, its ok to dedupe on block level but for most other uses, variable is king. One big example is backups. Netbackup and most others produce one stream with all data even when backing up to disk. Imagine you move a whole lot of data from one dir to another. Think a directory with huge video files. As a filesystem it would be de-duped nicely. The backup stream however may and may not have matching fs blocks. If the directory name before and after has the same lengths and such - then yeah, dedupe works. Directory name is a byte shorter? Everything in the stream will be offset by one byte - and no dedupe will occur at all on the whole dataset. In real world just compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block based) to a DataDomain (variable lenght) in this usage scenario. Among our customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 in the 7000. There are many other examples and in the end it comes down to if you want general purpose de-dupe (e.g. something useful when you serve iscsi luns, serve as backup target, ...) or if you only care about a pure file store, you''re probably going to be ok with fixed block lengths... Hope that helps, Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik <josef@redhat.com> wrote:> Here are patches to do offline deduplication for Btrfs. It works well for the > cases it''s expected to, I''m looking for feedback on the ioctl interface and > such, I''m well aware there are missing features for the userspace app (like > being able to set a different blocksize). If this interface is acceptable I > will flesh out the userspace app a little more, but I believe the kernel side is > ready to go. > > Basically I think online dedup is huge waste of time and completely useless. > You are going to want to do different things with different data. For example, > for a mailserver you are going to want to have very small blocksizes, but for > say a virtualization image store you are going to want much larger blocksizes. > And lets not get into heterogeneous environments, those just get much too > complicated. So my solution is batched dedup, where a user just runs this > command and it dedups everything at this point. This avoids the very costly > overhead of having to hash and lookup for duplicate extents online and lets us > be _much_ more flexible about what we want to deduplicate and how we want to do > it. > > For the userspace app it only does 64k blocks, or whatever the largest area it > can read out of a file. I''m going to extend this to do the following things in > the near future > > 1) Take the blocksize as an argument so we can have bigger/smaller blocks > 2) Have an option to _only_ honor the blocksize, don''t try and dedup smaller > blocks > 3) Use fiemap to try and dedup extents as a whole and just ignore specific > blocksizes > 4) Use fiemap to determine what would be the most optimal blocksize for the data > you want to dedup. > > I''ve tested this out on my setup and it seems to work well. I appreciate any > feedback you may have. Thanks, >FYI: Using clone ioctl can do the same thing (except reading data and computing hash in user space). Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I have been thinking a lot about de-duplication for a backup application > I am writing. I wrote a little script to figure out how much it would > save me. For my laptop home directory, about 100 GiB of data, it was a > couple of percent, depending a bit on the size of the chunks. With 4 KiB > chunks, I would save about two gigabytes. (That''s assuming no MD5 hash > collisions.) I don''t have VM images, but I do have a fair bit of saved > e-mail. So, for backups, I concluded it was worth it to provide an > option to do this. I have no opinion on whether it is worthwhile to do > in btrfs.Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time. Multi-gigabyte files which only have few megabytes changed can''t be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated). And I don''t even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms. Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes. On disk however, rsync creates a "temporary copy" of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have: original_file.bin .temporary_random_name Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:> >I have been thinking a lot about de-duplication for a backup application > >I am writing. I wrote a little script to figure out how much it would > >save me. For my laptop home directory, about 100 GiB of data, it was a > >couple of percent, depending a bit on the size of the chunks. With 4 KiB > >chunks, I would save about two gigabytes. (That''s assuming no MD5 hash > >collisions.) I don''t have VM images, but I do have a fair bit of saved > >e-mail. So, for backups, I concluded it was worth it to provide an > >option to do this. I have no opinion on whether it is worthwhile to do > >in btrfs. > > Online deduplication is very useful for backups of big, > multi-gigabyte files which change constantly. > Some mail servers store files this way; some MUA store the files > like this; databases are also common to pack everything in big files > which tend to change here and there almost all the time. > > Multi-gigabyte files which only have few megabytes changed can''t be > hardlinked; simple maths shows that even compressing multiple files > which have few differences will lead to greater space usage than a > few megabytes extra in each (because everything else is > deduplicated). > > And I don''t even want to think about IO needed to offline dedup a > multi-terabyte storage (1 TB disks and bigger are becoming standard > nowadays) i.e. daily, especially when the storage is already heavily > used in IO terms. > > > Now, one popular tool which can deal with small changes in files is > rsync. It can be used to copy files over the network - so that if > you want to copy/update a multi-gigabyte file which only has a few > changes, rsync would need to transfer just a few megabytes. > > On disk however, rsync creates a "temporary copy" of the original > file, where it packs unchanged contents together with any changes > made. For example, while it copies/updates a file, we will have: > > original_file.bin > .temporary_random_name > > Later, original_file.bin would be removed, and > .temporary_random_name would be renamed to original_file.bin. Here > goes away any deduplication we had so far, we have to start the IO > over again.Sounds like all you need is cp --reflink=always and rsync --inplace. Haven''t tested is that works well, though. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> Excerpts from Gordan Bobic''s message of 2011-01-05 12:42:42 -0500: >> Josef Bacik wrote: >> >>> Basically I think online dedup is huge waste of time and completely useless. >> I couldn''t disagree more. First, let''s consider what is the >> general-purpose use-case of data deduplication. What are the resource >> requirements to perform it? How do these resource requirements differ >> between online and offline? > > I don''t really agree with Josef that dedup is dumb, but I do think his > current approach is the most reasonable. Dedup has a few very valid use > cases, which I think break down to: > > 1) backups > 2) VM images. > > The backup farm use case is the best candidate for dedup in general > because they are generally write once and hopefully read never. > Fragmentation for reading doesn''t matter at all and we''re really very > sure we''re going to backup the same files over and over again. > > But, it''s also something that will be dramatically more efficient when > the backup server helps out. The backup server knows two files have the > same name, same size and can guess with very high accuracy that they > will be the same. So it is a very good candidate for Josef''s offline > dedup because it can just do the dedup right after writing the file.File level deduplication in addition to block level would be great, no argument there. This can again be done more efficiently in-line, though, as the files come in.> Next is the VM images. This is actually a much better workload for > online dedup, except for the part where our poor storage server would be > spending massive amounts of CPU deduping blocks for all the VMs on the > machine. In this case the storage server doesn''t know the > filenames, it just has bunches of blocks that are likely to be the same > across VMs.I''m still unconvinced that deduping''s major cost is CPU. I think in any real case it''ll be disk I/O.> So, it seems a bit silly to do this out of band, where we wander through > the FS and read a bunch of blocks in hopes of finding ones with the same > hash.Except you can get this almost for free. How about this approach: 1) Store a decent size hash for each block (checksum for ECC - something like this already happens, it''s just a question of what hashing algorithm to use) 2) Keep a (btree?) index of all known hashes (this doesn''t happen at the moment, AFAIK, so this would be the bulk of the new cost for dedup). Now there are 2 options: 3a) Offline - go through the index, find the blocks with duplicate hashes, relink the pointers to one of them and free the rest. There is no need to actually read/write any data unless we are doing full block compare, only metadata needs to be updated. The problem with this is that you would still have to do a full index scan of the index to find all the duplicates, unless there is a second index specifically listing the duplicate blocks (maintained at insertion time). 3b) Online - look up whether the hash for the current block is already in the index ( O(log(n)) ), and if it is, don''t bother writing the data block, only add a pointer to the existing block. No need for a second index with duplicate blocks in this case, either.> But, one of the things on our features-to-implement page is to wander > through the FS and read all the blocks from time to time. We want to do > this in the background to make sure the bits haven''t rotted on disk. By > scrubbing from time to time we are able to make sure that when a disk > does die, other disks in the FS are likely to have a good copy.Scrubbing the whole FS seems like an overly expensive way to do things and it also requires low-load times (which don''t necessarily exist). How about scrubbing disk-by-disk, rather than the whole FS? If we keep checksums per block, then each disk can be checked for rot independently. It also means that if redundancy is used, the system doesn''t end up anywhere nearly as crippled during the scrubbing as the requests can be served from other disks that are a part of that FS (e.g. the mirrored pair in RAID1, or the parity blocks in higher level RAIDs).> So again, Josef''s approach actually works very well. His dedup util > could be the scrubbing util and we''ll get two projects for the price of > one.Indeed, the scrubber would potentially give deduping functionality for free, but I''m not convinced that having deduping depend on scrubbing is the way forward. This is where we get to multi-tier deduping again - perhaps have things markable for online or offline dedupe, as deemed more appropriate?> As for the security of hashes, we''re unlikely to find a collision on a > sha256 that wasn''t made maliciously. If the system''s data is > controlled and you''re not worried about evil people putting files on > there, extra reads really aren''t required.If you manage to construct one maliciously that''s pretty bad in itself, though. :)> But then again, extra reads are a good thing (see above about > scrubbing).Only under very, very controlled conditions. This is supposed to be the "best" file system, not the slowest. :)> The complexity of the whole operation goes down > dramatically when we do the verifications because hash index corruptions > (this extent has this hash) will be found instead of blindly trusted.That is arguably an issue whatever you do, though. You have to trust that the data you get back off the disk is correct at least most of the time. Also, depending on what the extent of corruption is, you would potentially be able to spot it by having a hash out of order in the index (much cheaper, but says nothing about the integrity of the block itself).> None of this means that online dedup is out of the question, I just > think the offline stuff is a great way to start.As I said, I''m not against offline dedupe for certain use cases, I''m merely saying that at a glance, overall, online dedup is more efficient. Another point that was raised was about fragmentation. Thinking about it, there wouldn''t be any extra fragmentation where complete files'' worth of blocks were deduped. You''d end up with the blocks still laid out sequentially on the disk (assuming we don''t deliberately pick a random instance of a block to link to but instead do something sensible). Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Spelic wrote:> On 01/06/2011 02:03 AM, Gordan Bobic wrote: >> >> That''s just alarmist. AES is being cryptanalyzed because everything >> uses it. And the news of it''s insecurity are somewhat exaggerated (for >> now at least). > > Who cares... the fact of not being much used is a benefit for RIPEMD / > blowfish-twofish then. > Nobody makes viruses for Linux because they target windows. Same thing... > RIPEMD has still an advantage over SHA imho, and blowfish over AES.Just because nobody attacked it yet doesn''t justify complacency.>>> If there is full blocks compare, a simpler/faster algorithm could be >>> chosen, like md5. Or even a md-64bits which I don''t think it exists, but >>> you can take MD4 and then xor the first 8 bytes with the second 8 bytes >>> so to reduce it to 8 bytes only. This is just because it saves 60% of >>> the RAM occupation during dedup, which is expected to be large, and the >>> collisions are still insignificant at 64bits. Clearly you need to do >>> full blocks compare after that. >> >> I really don''t think the cost in terms of a few bytes per file for the >> hashes is that significant. > > 20 to 8 = 12 bytes per *filesystem block* saved, I think > Aren''t we talking about block-level deduplication? > For every TB of filesystem you occupy 2GB of RAM with hashes instead of > 5.3GB (I am assuming 4K blocks, I don''t remember how big are btrfs blocks) > For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might > be the edge between feasible and not feasible. > Actually it might not be feasible anyway... an option to store hashes > into a ssd should be provided then...You wouldn''t necessarily have to keep the whole index in RAM, but if you don''t you''d get hit for an extra O(log(n)) disk seeks. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A wrote:> On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: >>> I''d just make it always use the fs block size. No point in making it >>> variable. >> Agreed. What is the reason for variable block size? > > First post on this list - I mostly was just reading so far to learn more on fs > design but this is one topic I (unfortunately) have experience with... > > You wouldn''t believe the difference variable block size dedupe makes. For a > pure fileserver, its ok to dedupe on block level but for most other uses, > variable is king. One big example is backups. Netbackup and most others > produce one stream with all data even when backing up to disk. Imagine you > move a whole lot of data from one dir to another. Think a directory with huge > video files. As a filesystem it would be de-duped nicely. The backup stream > however may and may not have matching fs blocks. If the directory name before > and after has the same lengths and such - then yeah, dedupe works. Directory > name is a byte shorter? Everything in the stream will be offset by one byte - > and no dedupe will occur at all on the whole dataset. In real world just > compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block > based) to a DataDomain (variable lenght) in this usage scenario. Among our > customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 > in the 7000.Can you elaborate what you''re talking about here? How does the length of a directory name affect alignment of file block contents? I don''t see how variability of length matters, other than to make things a lot more complicated. Have you some real research/scientifically gathered data (non-hearsay/non-anecdotal) on the underlying reasons for the discrepancy in the deduping effectiveness you describe? 3-17x difference doesn''t plausibly come purely from fixed vs. variable length block sizes. The only case where I''d bother consider variable length deduping is in file deduping (rather than block), in this case we can just make a COW hard-link and it''s _really_ cheap and effective. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Chmielewski wrote:>> I have been thinking a lot about de-duplication for a backup application >> I am writing. I wrote a little script to figure out how much it would >> save me. For my laptop home directory, about 100 GiB of data, it was a >> couple of percent, depending a bit on the size of the chunks. With 4 KiB >> chunks, I would save about two gigabytes. (That''s assuming no MD5 hash >> collisions.) I don''t have VM images, but I do have a fair bit of saved >> e-mail. So, for backups, I concluded it was worth it to provide an >> option to do this. I have no opinion on whether it is worthwhile to do >> in btrfs. > > Online deduplication is very useful for backups of big, multi-gigabyte > files which change constantly. > Some mail servers store files this way; some MUA store the files like > this; databases are also common to pack everything in big files which > tend to change here and there almost all the time. > > Multi-gigabyte files which only have few megabytes changed can''t be > hardlinked; simple maths shows that even compressing multiple files > which have few differences will lead to greater space usage than a few > megabytes extra in each (because everything else is deduplicated). > > And I don''t even want to think about IO needed to offline dedup a > multi-terabyte storage (1 TB disks and bigger are becoming standard > nowadays) i.e. daily, especially when the storage is already heavily > used in IO terms. > > > Now, one popular tool which can deal with small changes in files is > rsync. It can be used to copy files over the network - so that if you > want to copy/update a multi-gigabyte file which only has a few changes, > rsync would need to transfer just a few megabytes. > > On disk however, rsync creates a "temporary copy" of the original file, > where it packs unchanged contents together with any changes made. For > example, while it copies/updates a file, we will have: > > original_file.bin > .temporary_random_name > > Later, original_file.bin would be removed, and .temporary_random_name > would be renamed to original_file.bin. Here goes away any deduplication > we had so far, we have to start the IO over again.You can tell rsync to either modify the file in place (--inplace) or to put the temp file somewhere else (--temp-dir=DIR). Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic wrote:> Josef Bacik wrote: > >> Basically I think online dedup is huge waste of time and completely >> useless. > > I couldn''t disagree more. First, let''s consider what is the > general-purpose use-case of data deduplication. What are the resource > requirements to perform it? How do these resource requirements differ > between online and offline?<snip>> As an aside, zfs and lessfs both do online deduping, presumably for a > good reason. > > Then again, for a lot of use-cases there are perhaps better ways to > achieve the targed goal than deduping on FS level, e.g. snapshotting or > something like fl-cow: > http://www.xmailserver.org/flcow.html >Just a small point; Josef''s work provides a building block for a userspace notify-based online dedupe daemon. The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it''s been dropped from RAM) and do the same process as an offline dedupe (hash, check for matches, call dedupe extent ioctl). If you''ve got enough CPU (maybe running with realtime privs), you should be able to do this before writes actually hit the disk. Further, a userspace daemon can do more sophisticated online dedupe than is reasonable in the kernel - e.g. queue the dedupe extent ioctl phase for idle time, only dedupe inodes that have been left unwritten for x minutes, different policies for different bits of the filesystem (dedupe crontabs immediately on write, dedupe outgoing mail spool only when the mail sticks around for a while, dedupe all incoming mail immediately, dedupe logfiles after rotation only, whatever is appropriate). It can also do more intelligent trickery than is reasonable in-kernel - e.g. if you know that you''re deduping e-mail (line-based), you can search line- by-line for dedupe blocks, rather than byte-by-byte. Having said all that, you may well find that having implemented a userspace online dedupe daemon, there are things the kernel can do to help; you may even find that you do need to move it entirely into the kernel. Just don''t think that this ioctl rules out online dedupe - in fact, it enables it. -- Simon Farnsworth -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Farnsworth wrote:> The basic idea is to use fanotify/inotify (whichever of the notification > systems works for this) to track which inodes have been written to. It can > then mmap() the changed data (before it''s been dropped from RAM) and do the > same process as an offline dedupe (hash, check for matches, call dedupe > extent ioctl). If you''ve got enough CPU (maybe running with realtime privs), > you should be able to do this before writes actually hit the disk.I''m not convinced that racing against the disk write is the way forward here. As for having enough CPU to do this, a lot of modern CPUs (ARM, SPARC, Xeon) actually have hardware crypto acceleration/offload, so calculating checksums is fast and cheap. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic wrote:> Simon Farnsworth wrote: > >> The basic idea is to use fanotify/inotify (whichever of the notification >> systems works for this) to track which inodes have been written to. It >> can then mmap() the changed data (before it''s been dropped from RAM) and >> do the same process as an offline dedupe (hash, check for matches, call >> dedupe extent ioctl). If you''ve got enough CPU (maybe running with >> realtime privs), you should be able to do this before writes actually hit >> the disk. > > I''m not convinced that racing against the disk write is the way forward > here. >The point is that implementing a userspace online dedupe daemon that races against the disk write is something that can be done by anyone who cares as soon as Josef''s patch is in place; if it''s clear that the userspace daemon just does something simple enough to put in the kernel (e.g. a fixed block size dedupe), and that extra complexity doesn''t gain enough to be worthwhile, the code can be ported into the kernel before it gets posted here. Similarly, if you''re convinced that it has to be in kernel (I''m not a dedupe or filesystems expert, so there may be good reasons I''m unaware of), you can reuse parts of Josef''s code to write your patch that creates a kernel thread to do the work. If it turns out that complex algorithms for online dedupe are worth the effort (like line-by-line e-mail dedupe), then you''ve got a starting point for writing something more complex, and determining just what the kernel needs to provide to make things nice - maybe it''ll be clear that you need an interface that lets you hold up a write while you do the simple end of the dedupe work, maybe there will be some other kernel interface that''s more generic than "dedupe fixed size blocks" that''s needed for efficient work. Either way, Josef''s work is a good starting point for online dedupe; you can experiment *now* without going into kernel code (heck, maybe even not C - Python or Perl would be OK for algorithm exploration), and use the offline dedupe support to simplify a patch for online dedupe. -- Simon Farnsworth -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, January 06, 2011 05:48:18 am you wrote:> Can you elaborate what you''re talking about here? How does the length of > a directory name affect alignment of file block contents? I don''t see > how variability of length matters, other than to make things a lot more > complicated.I''m saying in a filesystem it doesn''t matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty file in this dir with a name that just happens to be the first in the dir. This adds 512 bytes close to the beginning of the tar file the second time I run tar. Now the remainder of the is all offset by 512bytes and, if you do dedupe on fs- block sized chunks larger than the 512bytes, not a single byte will be de- duped. I know its a stupid example but it matches the backup-target and database dump usage pattern really well. Files backing iSCSI shows similar dedupe behavior. Essentially every time you bundle mutliple files together into one you run into things like that.> Have you some real research/scientifically gathered data > (non-hearsay/non-anecdotal) on the underlying reasons for the > discrepancy in the deduping effectiveness you describe? 3-17x difference > doesn''t plausibly come purely from fixed vs. variable length block sizes.Personal experience isn''t hearsay :) Netapp publishes a whitepaper against the 7000 making this a big point but that isn''t publicly available. Try search "zfs dedupe +netbackup" or "zfs dedupe +datadomain" and similar - you will hear of hundreds of people all complain about the same thing.> The only case where I''d bother consider variable length deduping is in > file deduping (rather than block), in this case we can just make a COW > hard-link and it''s _really_ cheap and effective.I take your word for this :)> > Gordan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A wrote:> On Thursday, January 06, 2011 05:48:18 am you wrote: >> Can you elaborate what you''re talking about here? How does the length of >> a directory name affect alignment of file block contents? I don''t see >> how variability of length matters, other than to make things a lot more >> complicated.>> I''m saying in a filesystem it doesn''t matter - if you bundle everything into a > backup stream, it does. Think of tar. 512 byte allignment. I tar up a > directory with 8TB total size. No big deal. Now I create a new, empty file in > this dir with a name that just happens to be the first in the dir. This adds > 512 bytes close to the beginning of the tar file the second time I run tar. Now > the remainder of the is all offset by 512bytes and, if you do dedupe on fs- > block sized chunks larger than the 512bytes, not a single byte will be de- > duped.OK, I get what you mean now. And I don''t think this is something that should be solved in the file system.> I know its a stupid example but it matches the backup-target and database dump > usage pattern really well. Files backing iSCSI shows similar dedupe behavior. > Essentially every time you bundle mutliple files together into one you run into > things like that.I suspect a large part of the underlying cause of this is that most things still operate on 512 byte sectors. Once that is replaced with 4KB sectors, that will probably go away. And if this is the case, then perhaps making the block size "variable" to 512, 1024, 2048 and 4096 bytes (probably in reverse order) will achieve most of that benefit. Whether than is a worthwhile thing to do for poorly designed backup solutions, but I''m not convinced about the general use-case. It''d be very expensive and complicated for seemingly very limited benefit.>> Have you some real research/scientifically gathered data >> (non-hearsay/non-anecdotal) on the underlying reasons for the >> discrepancy in the deduping effectiveness you describe? 3-17x difference >> doesn''t plausibly come purely from fixed vs. variable length block sizes.>> Personal experience isn''t hearsay :) Netapp publishes a whitepaper against the > 7000 making this a big point but that isn''t publicly available. Try search > "zfs dedupe +netbackup" or "zfs dedupe +datadomain" and similar - you will > hear of hundreds of people all complain about the same thing.Typical. And no doubt they complain that ZFS isn''t doing what they want, rather than netbackup not co-operating. The solution to one misdesign isn''t an expensive bodge. The solution to this particular problem is to make netbackup work on per-file rather than per stream basis. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 06, 2011 at 12:18:34PM +0000, Simon Farnsworth wrote:> Gordan Bobic wrote: > > > Josef Bacik wrote: > >snip >> > Then again, for a lot of use-cases there are perhaps better ways to > > achieve the targed goal than deduping on FS level, e.g. snapshotting or > > something like fl-cow: > > http://www.xmailserver.org/flcow.html > >As VM are concerned fl-cow is poor replacement of deduping. Upgrading packages? 1st vm upgrades and copies changed files. After while second upgrades and copies files too. More and more becomes duped again. If you host multiple distributions you need to translate that /usr/share/bin/foo in foonux is /us/bin/bar in barux And primary reason to dedupe is not to reduce space usage but to improve caching. Why should machine A read file if machine B read it five minutes ago.> -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- user to computer ration too low. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:> >CPU can handle considerably more than 250 block hashings per > >second. You could argue that this changes in cases of sequential > >I/O on big files, but a 1.86GHz GHz Core2 can churn through > >111MB/s of SHA256, which even SSDs will struggle to keep up with. > > A normal 1TB disk with platters can do 130MB/sec sequential, no problems. > A SSD can do more like 200MB/sec write 280MB/sec read sequential or > random and is actually limited only by the SATA 3.0gbit/sec but soon > enough they will have SATA/SAS 6.0gbit/sec.By “soon enough” you really meant “a year ago”, I think: http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300 Current 6Gbps SSD are doing 415 MB/s sequential: http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads or even claim 550MB/s: http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000 (funny bit: Sandforce SSD controllers dedup internally). Anyway, 6Gbps is not a future tale, but something long available. And not the fastest kids on the block: currently build filesystems must deal storage providing many gigabytes per second. Think of massive disk arrays or stuff like Oracle F5100, claiming 12.8GB/sec read and ~10GB/s write (in one rack unit). -- Tomasz Torcz "God, root, what''s the difference?" xmpp: zdzichubg@chrome.pl "God is more forgiving." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ondřej Bílka wrote:>>> Then again, for a lot of use-cases there are perhaps better ways to >>> achieve the targed goal than deduping on FS level, e.g. snapshotting or >>> something like fl-cow: >>> http://www.xmailserver.org/flcow.html >>> > As VM are concerned fl-cow is poor replacement of deduping.Depends on your VM. If your VM uses monolithic images, then you''re right. For a better solution, take a look at vserver''s hashify feature for something that does this very well in it''s own context.> Upgrading packages? 1st vm upgrades and copies changed files. > After while second upgrades and copies files too. More and more becomes duped again.So you want online dedupe, then. :)> If you host multiple distributions you need to translate > that /usr/share/bin/foo in foonux is /us/bin/bar in baruxThe chances of the binaries being the same between distros are between slim and none. In the context of VMs where you have access to raw files, as I said, look at vserver''s hashify feature. It doesn''t care about file names, it will COW hard-link all files with identical content. This doesn''t even require an exhaustive check of all the files'' contents - you can start with file sizes. Files that have different sizes can''t have the same contents, so you can discard most of the comparing before you even open the file, most of the work gets done based on metadata alone.> And primary reason to dedupe is not to reduce space usage but to > improve caching. Why should machine A read file if machine B read it five minutes ago.Couldn''t agree more. This is what I was trying to explain earlier. Even if deduping did cause more fragmentation (and I don''t think that is the case to any significant extent), the improved caching efficiency would more than offset this. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Torcz wrote:> On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote: >>> CPU can handle considerably more than 250 block hashings per >>> second. You could argue that this changes in cases of sequential >>> I/O on big files, but a 1.86GHz GHz Core2 can churn through >>> 111MB/s of SHA256, which even SSDs will struggle to keep up with. >> A normal 1TB disk with platters can do 130MB/sec sequential, no problems. >> A SSD can do more like 200MB/sec write 280MB/sec read sequential or >> random and is actually limited only by the SATA 3.0gbit/sec but soon >> enough they will have SATA/SAS 6.0gbit/sec. > > By “soon enough” you really meant “a year ago”, I think: > http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300 > Current 6Gbps SSD are doing 415 MB/s sequential: > http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads > or even claim 550MB/s: > http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000 > (funny bit: Sandforce SSD controllers dedup internally). > > Anyway, 6Gbps is not a future tale, but something long available. > And not the fastest kids on the block: currently build filesystems > must deal storage providing many gigabytes per second. Think > of massive disk arrays or stuff like Oracle F5100, claiming > 12.8GB/sec read and ~10GB/s write (in one rack unit).Sequential figures look nice and impressive but we all know they are meaningless for most real world workloads. IOPS are where it''s at. And maybe you can get 100,000 IOPS out of an SSD. But that still means 100,000 SHA256 hashes/second. That''s 3.2MB/s of SHA256 hashes, or about 2% of what a modern x64 CPU will do, assuming it doesn''t have a suitable hardware crypto accelerator for that algorithm. So on a reasonably recent quad core CPU you would probably be able to comfortably handle about 200x that before it starts becoming an issue. If you''re that concerned about space requirements, doing LZO compression will still be much more expensive. And that''s only for writes - on reads we don''t need to do any hashing (although it''s useful to do for the disk error checking reasons explained earlier). Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, January 06, 2011 09:00:47 am you wrote:> Peter A wrote: > > I''m saying in a filesystem it doesn''t matter - if you bundle everything > > into a backup stream, it does. Think of tar. 512 byte allignment. I tar > > up a directory with 8TB total size. No big deal. Now I create a new, > > empty file in this dir with a name that just happens to be the first in > > the dir. This adds 512 bytes close to the beginning of the tar file the > > second time I run tar. Now the remainder of the is all offset by > > 512bytes and, if you do dedupe on fs- block sized chunks larger than the > > 512bytes, not a single byte will be de- duped. > > OK, I get what you mean now. And I don''t think this is something that > should be solved in the file system.<snip>> Whether than is a worthwhile thing to do for poorly designed backup > solutions, but I''m not convinced about the general use-case. It''d be > very expensive and complicated for seemingly very limited benefit.Glad I finally explained myself properly... Unfortunately I disagree with you on the rest. If you take that logic, then I could claim dedupe is nothing a file system should handle - after all, its the user''s poorly designed applications that store multiple copies of data. Why should the fs take care of that? The problem doesn''t just affect backups. It affects everything where you have large data files that are not forced to allign with filesystem blocks. In addition to the case I mentioned above this affects in pretty much the same effectiveness: * Database dumps * Video Editing * Files backing iSCSI volumes * VM Images (fs blocks inside the VM rarely align with fs blocks in the backing storage). Our VM environment is backed with a 7410 and we get only about 10% dedupe. Copying the same images to a DataDomain results in a 60% reduction in space used. Basically, every time I end up using a lot of storage space, its in a scenario where fs-block based dedupe is not very effective. I also have to argue the point that these usages are "poorly designed". Poorly designed can only apply to technologies that existed or were talked about at the time the design was made. Tar and such have been around for a long time, way before anyone even though of dedupe. In addition, until there is a commonly accepted/standard API to query the block size so apps can generate files appropriately laid out for the backing filesystem, what is the application supposed to do? If anything, I would actually argue the opposite, that fixed block dedupe is a poor design: * The problem is known at the time the design was made * No alternative can be offered as tar, netbackup, video editing, ... has been around for a long time and is unlikely to change in the near future * There is no standard API to query the allignment parameters (and even that would not be great since copying a file alligned for 8k to a 16k alligned filesystem, would potentially cause the same issue again) Also from the human perspective its hard to make end users understand your point of view. I promote the 7000 series of storage and I know how hard it is to explain the dedupe behavior there. They see that Datadomain does it, and does it well. So why can''t solution xyz do it just as good?> Typical. And no doubt they complain that ZFS isn''t doing what they want, > rather than netbackup not co-operating. The solution to one misdesign > isn''t an expensive bodge. The solution to this particular problem is to > make netbackup work on per-file rather than per stream basis.I''d agree if it was just limited to netbackup... I know variable block length is a significantly more difficult problem than block level. That''s why the ZFS team made the design choice they did. Variable length is also the reason why the DataDomain solution is a scale out rather than scalue up approach. However, CPUs get faster and faster - eventually they''ll be able to handle it. So the right solution (from my limited point of view, as I said, I''m not a filesystem design expert) would be to implement the data structures to handle variable length. Then in the first iteration, implement the dedupe algorithm to only search on filesystem blocks using existing checksums and such. Less CPU usage, quicker development, easier debugging. Once that is stable and proven, you can then without requiring the user to reformat, go ahead and implement variable length dedupe... Btw, thanks for your time, Gordan :) Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A wrote:> On Thursday, January 06, 2011 09:00:47 am you wrote: >> Peter A wrote: >>> I''m saying in a filesystem it doesn''t matter - if you bundle everything >>> into a backup stream, it does. Think of tar. 512 byte allignment. I tar >>> up a directory with 8TB total size. No big deal. Now I create a new, >>> empty file in this dir with a name that just happens to be the first in >>> the dir. This adds 512 bytes close to the beginning of the tar file the >>> second time I run tar. Now the remainder of the is all offset by >>> 512bytes and, if you do dedupe on fs- block sized chunks larger than the >>> 512bytes, not a single byte will be de- duped. >> OK, I get what you mean now. And I don''t think this is something that >> should be solved in the file system. > <snip> >> Whether than is a worthwhile thing to do for poorly designed backup >> solutions, but I''m not convinced about the general use-case. It''d be >> very expensive and complicated for seemingly very limited benefit. > Glad I finally explained myself properly... Unfortunately I disagree with you > on the rest. If you take that logic, then I could claim dedupe is nothing a > file system should handle - after all, its the user''s poorly designed > applications that store multiple copies of data. Why should the fs take care > of that?There is merit in that point. Some applications do in fact do their own deduplication, as mentioned previously on this thread.> The problem doesn''t just affect backups. It affects everything where you have > large data files that are not forced to allign with filesystem blocks. In > addition to the case I mentioned above this affects in pretty much the same > effectiveness: > * Database dumps > * Video Editing > * Files backing iSCSI volumes > * VM Images (fs blocks inside the VM rarely align with fs blocks in the > backing storage). Our VM environment is backed with a 7410 and we get only > about 10% dedupe. Copying the same images to a DataDomain results in a 60% > reduction in space used.I''d be interested to hear about the relative write performance on the variable block size.> I also have to argue the point that these usages are "poorly designed". Poorly > designed can only apply to technologies that existed or were talked about at > the time the design was made. Tar and such have been around for a long time, > way before anyone even though of dedupe. In addition, until there is a > commonly accepted/standard API to query the block size so apps can generate > files appropriately laid out for the backing filesystem, what is the application > supposed to do?Indeed, this goes philosophically in the direction that storage vendors have been (unsuccessfully) been trying to shift the industry for decades - object based storage. But, sadly, it hasn''t happened (yet?).> If anything, I would actually argue the opposite, that fixed block dedupe is a > poor design: > * The problem is known at the time the design was made > * No alternative can be offered as tar, netbackup, video editing, ... has been > around for a long time and is unlikely to change in the near future > * There is no standard API to query the allignment parameters (and even that > would not be great since copying a file alligned for 8k to a 16k alligned > filesystem, would potentially cause the same issue again) > > Also from the human perspective its hard to make end users understand your > point of view. I promote the 7000 series of storage and I know how hard it is > to explain the dedupe behavior there. They see that Datadomain does it, and > does it well. So why can''t solution xyz do it just as good?I''d be interested to see the evidence of the "variable length" argument. I have a sneaky suspicion that it actually falls back to 512 byte blocks, which are much more likely to align, when more sensibly sized blocks fail. The downside is that you don''t really want to store a 32 byte hash key with every 512 bytes of data, so you could peel off 512 byte blocks off the front in a hope that a bigger block that follows will match. Thinking about it, this might actually not be too expensive to do. If the 4KB block doesn''t match, check 512 byte sub-blocks, and try peeling them, to make the next one line up. If it doesn''t, store the mismatch as a full 4KB block and resume. If you do find a match, save the peeled 512 byte blocks separately and dedupe the 4KB block. In fact, it''s rather like the loop peeling optimization on a compiler, that allows you to align the data to the boundary suitable for vectorizing.>> Typical. And no doubt they complain that ZFS isn''t doing what they want, >> rather than netbackup not co-operating. The solution to one misdesign >> isn''t an expensive bodge. The solution to this particular problem is to >> make netbackup work on per-file rather than per stream basis.>> I''d agree if it was just limited to netbackup... I know variable block length > is a significantly more difficult problem than block level. That''s why the ZFS > team made the design choice they did. Variable length is also the reason why > the DataDomain solution is a scale out rather than scalue up approach. > However, CPUs get faster and faster - eventually they''ll be able to handle it. > So the right solution (from my limited point of view, as I said, I''m not a > filesystem design expert) would be to implement the data structures to handle > variable length. Then in the first iteration, implement the dedupe algorithm to > only search on filesystem blocks using existing checksums and such. Less CPU > usage, quicker development, easier debugging. Once that is stable and proven, > you can then without requiring the user to reformat, go ahead and implement > variable length dedupe...Actually, see above - I believe I was wrong about how expensive "variable length" block size is likely to be. It''s more expensive, sure, but not orders of magnitude more expensive, and as discussed earlier, given the CPU isn''t really the key bottleneck here, I think it''d be quite workable.> Btw, thanks for your time, Gordan :)You''re welcome. :) Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 06, 2011 at 02:41:28PM +0000, Gordan Bobic wrote:> Ondřej Bílka wrote: > > >>>Then again, for a lot of use-cases there are perhaps better ways to > >>>achieve the targed goal than deduping on FS level, e.g. snapshotting or > >>>something like fl-cow: > >>>http://www.xmailserver.org/flcow.html > >>> > >As VM are concerned fl-cow is poor replacement of deduping. > > Depends on your VM. If your VM uses monolithic images, then you''re > right. For a better solution, take a look at vserver''s hashify > feature for something that does this very well in it''s own context. > > >Upgrading packages? 1st vm upgrades and copies changed files. > >After while second upgrades and copies files too. More and more becomes duped again. > > So you want online dedupe, then. :) > > >If you host multiple distributions you need to translate > >that /usr/share/bin/foo in foonux is /us/bin/bar in barux > > The chances of the binaries being the same between distros are > between slim and none. In the context of VMs where you have access > to raw files, as I said, look at vserver''s hashify feature. It > doesn''t care about file names, it will COW hard-link all files with > identical content. This doesn''t even require an exhaustive check of > all the files'' contents - you can start with file sizes. Files that > have different sizes can''t have the same contents, so you can > discard most of the comparing before you even open the file, most of > the work gets done based on metadata alone. >Yes I wrote this as quick example. On second thought files shared between distros are typicaly write-only(like manpages)> > >And primary reason to dedupe is not to reduce space usage but to > >improve caching. Why should machine A read file if machine B read it five minutes ago. > > Couldn''t agree more. This is what I was trying to explain earlier. > Even if deduping did cause more fragmentation (and I don''t think > that is the case to any significant extent), the improved caching > efficiency would more than offset this. > > Gordan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Program load too heavy for processor to lift. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, January 06, 2011 10:07:03 am you wrote:> I''d be interested to see the evidence of the "variable length" argument. > I have a sneaky suspicion that it actually falls back to 512 byte > blocks, which are much more likely to align, when more sensibly sized > blocks fail. The downside is that you don''t really want to store a 32 > byte hash key with every 512 bytes of data, so you could peel off 512 > byte blocks off the front in a hope that a bigger block that follows > will match. > > Thinking about it, this might actually not be too expensive to do. If > the 4KB block doesn''t match, check 512 byte sub-blocks, and try peeling > them, to make the next one line up. If it doesn''t, store the mismatch as > a full 4KB block and resume. If you do find a match, save the peeled 512 > byte blocks separately and dedupe the 4KB block. > > In fact, it''s rather like the loop peeling optimization on a compiler, > that allows you to align the data to the boundary suitable for vectorizing.I''m not sure about this but to be honest I can not see any other way. Otherwise, how would you ever find a match? You can not store checksums of random sub-sections hoping that eventually stuff will match up... 512 byte is probably the best choice as its been a "known block size" since the dawn of unix.> Actually, see above - I believe I was wrong about how expensive > "variable length" block size is likely to be. It''s more expensive, sure, > but not orders of magnitude more expensive, and as discussed earlier, > given the CPU isn''t really the key bottleneck here, I think it''d be > quite workable.Hmm - from my work with DD systems, it seems to be the CPU that ends up being the limiting factor on dedupe performance... But again, it seems that you have much deeper insight into this topic and have already drawn up an algorithm in your mind :) Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote:> On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: > > >I have been thinking a lot about de-duplication for a backup application > > >I am writing. I wrote a little script to figure out how much it would > > >save me. For my laptop home directory, about 100 GiB of data, it was a > > >couple of percent, depending a bit on the size of the chunks. With 4 KiB > > >chunks, I would save about two gigabytes. (That''s assuming no MD5 hash > > >collisions.) I don''t have VM images, but I do have a fair bit of saved > > >e-mail. So, for backups, I concluded it was worth it to provide an > > >option to do this. I have no opinion on whether it is worthwhile to do > > >in btrfs. > > > > Online deduplication is very useful for backups of big, > > multi-gigabyte files which change constantly. > > Some mail servers store files this way; some MUA store the files > > like this; databases are also common to pack everything in big files > > which tend to change here and there almost all the time. > > > > Multi-gigabyte files which only have few megabytes changed can''t be > > hardlinked; simple maths shows that even compressing multiple files > > which have few differences will lead to greater space usage than a > > few megabytes extra in each (because everything else is > > deduplicated). > > > > And I don''t even want to think about IO needed to offline dedup a > > multi-terabyte storage (1 TB disks and bigger are becoming standard > > nowadays) i.e. daily, especially when the storage is already heavily > > used in IO terms. > > > > > > Now, one popular tool which can deal with small changes in files is > > rsync. It can be used to copy files over the network - so that if > > you want to copy/update a multi-gigabyte file which only has a few > > changes, rsync would need to transfer just a few megabytes. > > > > On disk however, rsync creates a "temporary copy" of the original > > file, where it packs unchanged contents together with any changes > > made. For example, while it copies/updates a file, we will have: > > > > original_file.bin > > .temporary_random_name > > > > Later, original_file.bin would be removed, and > > .temporary_random_name would be renamed to original_file.bin. Here > > goes away any deduplication we had so far, we have to start the IO > > over again. > > Sounds like all you need is cp --reflink=always and rsync --inplace. > > Haven''t tested is that works well, though.It works very well, btrfs with snapshots, compression and rsync --inplace has better storage utilisation than lessfs at around 10-15 snapshots with around 600GB of test data in small files. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Peter A''s message of 2011-01-05 22:58:36 -0500:> On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: > > > I''d just make it always use the fs block size. No point in making it > > > variable. > > > > Agreed. What is the reason for variable block size? > > First post on this list - I mostly was just reading so far to learn more on fs > design but this is one topic I (unfortunately) have experience with... > > You wouldn''t believe the difference variable block size dedupe makes. For a > pure fileserver, its ok to dedupe on block level but for most other uses, > variable is king. One big example is backups. Netbackup and most others > produce one stream with all data even when backing up to disk. Imagine you > move a whole lot of data from one dir to another. Think a directory with huge > video files. As a filesystem it would be de-duped nicely. The backup stream > however may and may not have matching fs blocks. If the directory name before > and after has the same lengths and such - then yeah, dedupe works. Directory > name is a byte shorter? Everything in the stream will be offset by one byte - > and no dedupe will occur at all on the whole dataset. In real world just > compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block > based) to a DataDomain (variable lenght) in this usage scenario. Among our > customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 > in the 7000.What is the smallest granularity that the datadomain searches for in terms of dedup? Josef''s current setup isn''t restricted to a specific block size, but there is a min match of 4k. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote:> What is the smallest granularity that the datadomain searches for in > terms of dedup? > > Josef''s current setup isn''t restricted to a specific block size, but > there is a min match of 4k.I talked to a few people I know and didn''t get a clear answer either... However, 512 bytes came up more than once. I''m not really worried about the size of region to be used, but about offsetting it... its so easy to create large tars, ... where the content is offset by a few bytes, mutliples of 512 and such. Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:> > I think that dedup has a variety of use cases that are all very dependent > on your workload. The approach you have here seems to be a quite > reasonable one. > > I did not see it in the code, but it is great to be able to collect > statistics on how effective your hash is and any counters for the extra > IO imposed. >So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about?> Also very useful to have a paranoid mode where when you see a hash > collision (dedup candidate), you fall back to a byte-by-byte compare to > verify that the the collision is correct. Keeping stats on how often > this is a false collision would be quite interesting as well :) >So I''ve always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don''t trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Josef Bacik''s message of 2011-01-10 10:37:31 -0500:> On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: > > > > I think that dedup has a variety of use cases that are all very dependent > > on your workload. The approach you have here seems to be a quite > > reasonable one. > > > > I did not see it in the code, but it is great to be able to collect > > statistics on how effective your hash is and any counters for the extra > > IO imposed. > > > > So I have counters for how many extents are deduped and the overall file > savings, is that what you are talking about? > > > Also very useful to have a paranoid mode where when you see a hash > > collision (dedup candidate), you fall back to a byte-by-byte compare to > > verify that the the collision is correct. Keeping stats on how often > > this is a false collision would be quite interesting as well :) > > > > So I''ve always done a byte-by-byte compare, first in userspace but now its in > kernel, because frankly I don''t trust hashing algorithms with my data. It would > be simple enough to keep statistics on how often the byte-by-byte compare comes > out wrong, but really this is to catch changes to the file, so I have a > suspicion that most of these statistics would be simply that the file changed, > not that the hash was a collision. Thanks,At least in the kernel, if you''re comparing extents on disk that are from a committed transaction. The contents won''t change. We could read into a private buffer instead of into the file''s address space to make this more reliable/strict. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote:> Excerpts from Josef Bacik''s message of 2011-01-10 10:37:31 -0500: > > On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: > > > > > > I think that dedup has a variety of use cases that are all very dependent > > > on your workload. The approach you have here seems to be a quite > > > reasonable one. > > > > > > I did not see it in the code, but it is great to be able to collect > > > statistics on how effective your hash is and any counters for the extra > > > IO imposed. > > > > > > > So I have counters for how many extents are deduped and the overall file > > savings, is that what you are talking about? > > > > > Also very useful to have a paranoid mode where when you see a hash > > > collision (dedup candidate), you fall back to a byte-by-byte compare to > > > verify that the the collision is correct. Keeping stats on how often > > > this is a false collision would be quite interesting as well :) > > > > > > > So I''ve always done a byte-by-byte compare, first in userspace but now its in > > kernel, because frankly I don''t trust hashing algorithms with my data. It would > > be simple enough to keep statistics on how often the byte-by-byte compare comes > > out wrong, but really this is to catch changes to the file, so I have a > > suspicion that most of these statistics would be simply that the file changed, > > not that the hash was a collision. Thanks, > > At least in the kernel, if you''re comparing extents on disk that are > from a committed transaction. The contents won''t change. We could read > into a private buffer instead of into the file''s address space to make > this more reliable/strict. >Right sorry I was talking in the userspace case. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, I like your idea and implementation for offline deduplication a lot. I think it will save me 50% of my backup storage! Your code walks/scans the directory/file tree of the filesystem. Would it be possible to walk/scan the disk extents sequentially in disk order? - This would be more I/O-efficient - This would save you reading previously deduped/snapshotted/hardlinked files more than once. - Maybe this would make it possible to deduplicate directories as well. Met vriendelijke groet, Arjen Nienhuis P.S. The NTFS implementation on Windows has ''ioctls'' to read the MFT sequentially in disk order and it''s *fast*. It''s being used for things like defrag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html