thr3ads.net - rsync - data deduplication [May 2010]

If this information is useful, please help other people find it:
Share via:

Mag Gam

2010-May-25 10:41 UTC

data deduplication

I know rsync can do many things but I was wondering if anyone is using
it for data deduplication on a large filesystem. I have a filesystem
which is about 2TB and I want to make sure I don't have the same data
in a different place of a filesystem. Is there an algorithm for that?

TIA

Fabian Cenedese

2010-May-25 10:54 UTC

head link

data deduplication

At 06:41 25.05.2010 -0400, Mag Gam wrote:>I know rsync can do many things but I was wondering if anyone is using
>it for data deduplication on a large filesystem. I have a filesystem
>which is about 2TB and I want to make sure I don't have the same data
>in a different place of a filesystem. Is there an algorithm for that?
I think for just finding same files you might be better off with SHA, MD5
oder other hashes. Even if there are same hashes I'd check first before
removing one of them, or at least make a full compare once the hashes
match. I don't think rsync is up to the task unless you want e.g. merge
two slightly different trees and then delete the remainder. rsync could
help in that case but only if the file names match. If you want to find
same files with different names I think you can't do it with rsync.

bye  Fabi

Benjamin Watkins

2010-May-25 13:26 UTC

head link

data deduplication

On 5/25/2010 6:41 AM, Mag Gam wrote:> I know rsync can do many things but I was wondering if anyone is using
> it for data deduplication on a large filesystem. I have a filesystem
> which is about 2TB and I want to make sure I don't have the same data
> in a different place of a filesystem. Is there an algorithm for that?
>    
While rsync is not an appropriate tool for this, I have successfully 
used dupseek in the past.

     http://freshmeat.net/projects/dupseek/

It is a perl script, so I expect you should be able to use it on any 
platform you need.  It show support for POSIX/Linux, but I expect it can 
run under Windows as well if you are comfortable with Cygwin.

I'm sure there are many more tools like this.  I used this one because 
it was optimized for large files.

-Ben

Seemingly Similar Threads

Search for more possibly parallel threads

rsync - May 2010 - data deduplication

data deduplication

data deduplication

data deduplication

Seemingly Similar Threads