I know rsync can do many things but I was wondering if anyone is using it for data deduplication on a large filesystem. I have a filesystem which is about 2TB and I want to make sure I don't have the same data in a different place of a filesystem. Is there an algorithm for that? TIA
At 06:41 25.05.2010 -0400, Mag Gam wrote:>I know rsync can do many things but I was wondering if anyone is using >it for data deduplication on a large filesystem. I have a filesystem >which is about 2TB and I want to make sure I don't have the same data >in a different place of a filesystem. Is there an algorithm for that?I think for just finding same files you might be better off with SHA, MD5 oder other hashes. Even if there are same hashes I'd check first before removing one of them, or at least make a full compare once the hashes match. I don't think rsync is up to the task unless you want e.g. merge two slightly different trees and then delete the remainder. rsync could help in that case but only if the file names match. If you want to find same files with different names I think you can't do it with rsync. bye Fabi
On 5/25/2010 6:41 AM, Mag Gam wrote:> I know rsync can do many things but I was wondering if anyone is using > it for data deduplication on a large filesystem. I have a filesystem > which is about 2TB and I want to make sure I don't have the same data > in a different place of a filesystem. Is there an algorithm for that? >While rsync is not an appropriate tool for this, I have successfully used dupseek in the past. http://freshmeat.net/projects/dupseek/ It is a perl script, so I expect you should be able to use it on any platform you need. It show support for POSIX/Linux, but I expect it can run under Windows as well if you are comfortable with Cygwin. I'm sure there are many more tools like this. I used this one because it was optimized for large files. -Ben