Hi, Say I download a large file from the net to /mnt/a.iso. I then download the same file again to /mnt/b.iso. These files now have the same content, but are stored twice since the copies weren''t made with the bcp utility. The same occurs if a directory tree with duplicate files (created with bcp) is put through a non-aware program - for example tarred and then untarred again. This could be improved in two ways: 1) Make a utility which checks the checksums for all the data extents, and if the checksums of data match for two files then check the file data, and if the file data matches then keep only one copy. It could be run as a cron job to free up disk space on systems where duplicate data is common (eg. virtual machine images) 2) Keep a tree of checksums for data blocks, so that a bit of data can be located by it''s checksum. Whenever a data block is about to be written check if the block matches any known block, and if it does then don''t bother duplicating the data on disk. I suspect this option may not be realistic for performance reasons. If either is possible then thought needs to be put into if it''s worth doing on a file level, or a partial-file level (ie. if I have two similar files, can the space used by the identical parts of the files be saved) Has any thought been put into either 1) or 2) - are either possible or desired? Thanks Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi all, On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos <oliver.mattos08@imperial.ac.uk> wrote:> Hi, > > Say I download a large file from the net to /mnt/a.iso. I then download > the same file again to /mnt/b.iso. These files now have the same > content, but are stored twice since the copies weren''t made with the bcp > utility. > > The same occurs if a directory tree with duplicate files (created with > bcp) is put through a non-aware program - for example tarred and then > untarred again. > > This could be improved in two ways: > > 1) Make a utility which checks the checksums for all the data extents, > and if the checksums of data match for two files then check the file > data, and if the file data matches then keep only one copy. It could be > run as a cron job to free up disk space on systems where duplicate data > is common (eg. virtual machine images) >I have a perl script that does this, made by a friend of mine. First it sweeps a dir/ and stats() every file, putting all files with same size X in a linked list on a hashtable entry for size X. Then it will md5sum all files with same bytesize to confirm if they really are copies of each others. Because if first only stats, and only md5sum files with "potencial" duplicates, its faster than regular scripts I''ve seen. Do you want this ?> 2) Keep a tree of checksums for data blocks, so that a bit of data can > be located by it''s checksum. Whenever a data block is about to be > written check if the block matches any known block, and if it does then > don''t bother duplicating the data on disk. I suspect this option may > not be realistic for performance reasons. > > If either is possible then thought needs to be put into if it''s worth > doing on a file level, or a partial-file level (ie. if I have two > similar files, can the space used by the identical parts of the files be > saved) > > Has any thought been put into either 1) or 2) - are either possible or > desired? > > Thanks > Oliver > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- Miguel Sousa Filipe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-12-09 at 22:48 +0000, Oliver Mattos wrote:> Hi, > > Say I download a large file from the net to /mnt/a.iso. I then download > the same file again to /mnt/b.iso. These files now have the same > content, but are stored twice since the copies weren''t made with the bcp > utility. > > The same occurs if a directory tree with duplicate files (created with > bcp) is put through a non-aware program - for example tarred and then > untarred again. > > This could be improved in two ways: > > 1) Make a utility which checks the checksums for all the data extents, > and if the checksums of data match for two files then check the file > data, and if the file data matches then keep only one copy. It could be > run as a cron job to free up disk space on systems where duplicate data > is common (eg. virtual machine images) >Sage did extend the ioctl used by bcp to be able to deal with ranges of files bytes. So, it could be used to do the actual cow step for this utility.> 2) Keep a tree of checksums for data blocks, so that a bit of data can > be located by it''s checksum. Whenever a data block is about to be > written check if the block matches any known block, and if it does then > don''t bother duplicating the data on disk. I suspect this option may > not be realistic for performance reasons. >When compression was added, the writeback path was changed to make option #2 viable, at least in the case where the admin is willing to risk hash collisions on strong hashes. When the a direct read comparison is required before sharing blocks, it is probably best done by a stand alone utility, since we don''t want wait for a read of a full extent every time we want to write on.> If either is possible then thought needs to be put into if it''s worth > doing on a file level, or a partial-file level (ie. if I have two > similar files, can the space used by the identical parts of the files be > saved)>From the kernel, the easiest granularity is the block level.> > Has any thought been put into either 1) or 2) - are either possible or > desired?These are definitely on the long term feature list. I don''t think we''ll get them before 1.0, but its an important feature. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
De-duplication is useful in data backup systems because of the high level of data redundancy, but I''m not sure whether it is necessary for a general-purposed fs. If you really want to do so, I will suggest the latter one. File-level de-dup can be implemented in a user-level application. As I googled the internet, I found a previous discussion on this topic in the mail archive. I hope this will help. http://markmail.org/message/sdxos4s2bnckven5#query:+page:1+mid:sdxos4s2bnckven5+state:results On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos <oliver.mattos08@imperial.ac.uk> wrote:> Hi, > > Say I download a large file from the net to /mnt/a.iso. I then download > the same file again to /mnt/b.iso. These files now have the same > content, but are stored twice since the copies weren''t made with the bcp > utility. > > The same occurs if a directory tree with duplicate files (created with > bcp) is put through a non-aware program - for example tarred and then > untarred again. > > This could be improved in two ways: > > 1) Make a utility which checks the checksums for all the data extents, > and if the checksums of data match for two files then check the file > data, and if the file data matches then keep only one copy. It could be > run as a cron job to free up disk space on systems where duplicate data > is common (eg. virtual machine images) > > 2) Keep a tree of checksums for data blocks, so that a bit of data can > be located by it''s checksum. Whenever a data block is about to be > written check if the block matches any known block, and if it does then > don''t bother duplicating the data on disk. I suspect this option may > not be realistic for performance reasons. > > If either is possible then thought needs to be put into if it''s worth > doing on a file level, or a partial-file level (ie. if I have two > similar files, can the space used by the identical parts of the files be > saved) > > Has any thought been put into either 1) or 2) - are either possible or > desired? > > Thanks > Oliver-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 2) Keep a tree of checksums for data blocks, so that a bit of data can > > be located by it''s checksum. Whenever a data block is about to be > > written check if the block matches any known block, and if it does then > > don''t bother duplicating the data on disk. I suspect this option may > > not be realistic for performance reasons. > > > > When compression was added, the writeback path was changed to make > option #2 viable, at least in the case where the admin is willing to > risk hash collisions on strong hashes. When the a direct read > comparison is required before sharing blocks, it is probably best done > by a stand alone utility, since we don''t want wait for a read of a full > extent every time we want to write on. >Can we assume hash collisions won''t occur? I mean if it''s a 256 bit hash then even with 256TB of data, and one hash per block, the chances of collision are still too small to calculate on gnome calculator. The only issue is if the hash algorithm is later found to be flawed, a malicious bit of data could be stored on the disk who''s hash would collide with some more important data, potentially allowing the contents of one file to be replaced with another. Even if we don''t assume hash collisions won''t occur (eg. for crc''s), the write performance when writing duplicate files is equal to the read performance of the disk, since for every block written by a program, one block will need to be read, and no blocks written. This is still better than the write case (as most devices read faster than write), and has the added advantage of saving lots of space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts wrote:> > When the a direct read > > comparison is required before sharing blocks, it is probably best done > > by a stand alone utility, since we don''t want wait for a read of a full > > extent every time we want to write on. > > Can a stand-alone utility do this without a race? Eg, if a portion of one > of the files has already been read by the util, but is changed before the > util has a chance to actually do the ioctl to duplicate the range. > > If we assume someone is keeping a hash table of "likely" matches, might it > make sense to have a verify-and-dup ioctl that does this safely? >This sounds good because then a standard user could safely do this to any file as long as he had read access to both files, but wouldn''t need write access (since the operation doesn''t actually modify the file data). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I lost the original post so I''m jumping in at the wrong thread-point :) Someone mentioned that the primary usage of de-dup is in the backup realm. True perhaps currently, but de-dup IMO is *the* killer app in the world of virtualization and is a huge reason why we''re picking NetApp at work to back our NFS VMware DataStores. We easily see 50% savings in space. I know of only one other production filesystem implementation of data-dedup -- GreenBytes has it in their ZFS-based storage product. I''m not sure why this hasn''t caught on, but as soon as a solid and fast implementation of it exists in the Linux world I really think it can catch on for VM datastores.... I know we''ve hollered at Sun as to why they haven''t rolled it out for ZFS yet! Anyways, I know it''s on the roadmap, just like throwing my $0.02 once in a while on how big a feature I think this could be...... Great job all :) Ray -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I see quite a few uses for this, and while it looks like the kernel mode automatic de-dup-on-write code might be performance costly, require disk format changes, and be controversial, it sounds like the user mode utility could be implemented today. It looks like a simple script could do the job - just iterate through every file in the filesystem, run md5sum on every block of every file, whenever a duplicate is found call an ioctl to remove the duplicate data. By md5summing each block it can also effectively compress disk images. While not very efficient it should work, and having something like this in the toolkit would mean as soon as btrfs gets stable enough for everyday use it would immediately out-do every other linux filesystem in terms of space efficiency for some workloads. In the long term kernel mode de-duplication would probably be good. I''m willing to bet even the average user has say 1-2% of data duplicated somewhere on the HD due to accidental copies instead of moves, same application installed to two different paths, two users who happen to have the same file each saved in their home folder, etc, so even the average user will slightly benefit. I''m considering writing that script to test on my ext3 disk just to see how much duplicate wasted data I really have. Thanks Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake thusly:> I''m considering writing that script to test on my ext3 disk just to see > how much duplicate wasted data I really have.Check out the fdupes command. In Fedora 8 it is in the yum repo as fdupes-1.40-10.fc8 -- Tracy Reed http://cuttedge.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-12-10 at 13:57 -0800, Tracy Reed wrote:> On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake thusly: > > I''m considering writing that script to test on my ext3 disk just to see > > how much duplicate wasted data I really have. > > Check out the fdupes command. In Fedora 8 it is in the yum repo as > fdupes-1.40-10.fc8 >FDupes is good for whole file matches, but btrfs supports partial-file duplication elimination (eg. where two big files share some parts in common but other parts have changed). Fdupes would be a good start though. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 10, 2008 at 01:57:54PM -0800, Tracy Reed wrote:> On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake thusly: > > I''m considering writing that script to test on my ext3 disk just to see > > how much duplicate wasted data I really have. > > Check out the fdupes command. In Fedora 8 it is in the yum repo as > fdupes-1.40-10.fc8 >Neat tool. I guess this just checks for duplicate files though. It would be interesting to see how many duplicate *blocks* there are across the filesystem, agnostic to files... Is this somthing your script does Oliver? Ray -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> It would be interesting to see how many duplicate *blocks* there are > across the filesystem, agnostic to files... > > Is this somthing your script does Oliver?My script doesn''t yet exist, although when created it would, yes. I was thinking of just making a BASH script and using dd to extract 512 byte chunks of files, pipe through md5sum and save the result in a large index file. Next just iterate through the index file looking for duplicate hashes. In fact that sounds so easy I might do it right now... (only to proof of concept stage - a real utility would probably want to be written in a compiled language and use proper trees for faster searching) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Here is a script to locate duplicate data WITHIN files: On some test file sets of binary data with no duplicated files, about 3% of the data blocks were duplicated, and about 0.1% of the data blocks were nulls. The data was mainly elf and win32 binaries plus some random game data, office documents and a few images. This code is hideously slow, so don''t give it more than a couple of MB of files to chew through at once. In retrospect I should''ve just written it in plain fast C instead of fighting with bash pipes! Note to get "verbose" output, just remove everything after the word "sort" in the code. ___________________ V CODE V ____________________________ #!/bin/bash # **********************************************************# # Redunt data detector # # # # Simple Script to take an MD5 hash of every block in every # # file in a folder and detect identical blocks # # # # Copyright 2008 Oliver Mattos, Released under the GPL. # # **********************************************************# # WARNING - This script is very inefficient, so don''t run it # with more than 50,000 blocks at once. : ${1?"Usage: $0 PATH [BlockSize]"} BS=${2-512} #Block Size in bytes, can be specified on command line NULLCOUNT="0" DUPCOUNT="0" TOTCOUNT="0" NULLHASH=`dd if=/dev/zero bs=$BS count=1 2>/dev/null | md5sum -b` NULLHASH="${NULLHASH:0:32}" find "$1" | \ while read i; do if [ "` stat "$i" -c%f `" == "81a4" ]; then LEN=` stat "$i" -c%s ` BC=0 while [ $LEN -gt $[$BC * $BS] ]; do echo `dd if="$i" bs=$BS count=1 skip=$BC 2>/dev/null | md5sum -b` $BC $i BC=$[ BC + 1 ] done fi; done | sort | while read j; do OLDHASH=$HASH HASH=${j:0:32} TOTCOUNT=$[ $TOTCOUNT + 1 ] if [ "$HASH" == "$OLDHASH" ]; then DUPCOUNT=$[ DUPCOUNT + 1 ] if [ "$HASH" == "$NULLHASH" ]; then NULLCOUNT=$[ NULLCOUNT + 1 ] fi fi echo Hashed $TOTCOUNT $BS byte blocks, found $DUPCOUNT redundant \ blocks of data, of which $NULLCOUNT blocks were null. done | tail -n 1 # these last two lines are a bodge because the variables dont seem to # come out of the while properly, probably something todo with the # pipes... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 11, 2008 at 03:42:58AM +0000, Oliver Mattos wrote:> Here is a script to locate duplicate data WITHIN files: > > On some test file sets of binary data with no duplicated files, about 3% > of the data blocks were duplicated, and about 0.1% of the data blocks > were nulls. The data was mainly elf and win32 binaries plus some random > game data, office documents and a few images. > > This code is hideously slow, so don''t give it more than a couple of MB > of files to chew through at once. In retrospect I should''ve just > written it in plain fast C instead of fighting with bash pipes! > > Note to get "verbose" output, just remove everything after the word > "sort" in the code.Neat. Thanks much. It''d be cool to output the results of each of your hashes to a database so you can get a feel for how many duplicate blocks there are cross-files as well. I''d like to run this in a similar setup on all my VMware VMDK files and get an idea of how much space savings there would be across 20+ Windows 2003 VMDK files... probably *lots* of common blocks. Ray -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Neat. Thanks much. It''d be cool to output the results of each of your > hashes to a database so you can get a feel for how many duplicate > blocks there are cross-files as well. > > I''d like to run this in a similar setup on all my VMware VMDK files and > get an idea of how much space savings there would be across 20+ Windows > 2003 VMDK files... probably *lots* of common blocks. > > RayHi, Currently it DOES do cross file block matching - thats why it takes sooo long to run :-) If you remove everything after the word "sort", it will make a verbose output, which you could then stick into some SQL database if you wanted. You could relativey easily format the output into a format for direct input to an SQL database if you modified the line with the "dd" in it within the first while. You can also remove the "sort" and the pipe before it to get an unsorted output - the advantage of this is it takes less time. I''m guessing that if you had the time to run this on multi-gigabyte disk images you''d find that as much as 80% of the blocks are duplicated between any two virtual machines of the same operating system. That means if you have 20 Win 2k3 VM''s and the first VM image of Windows + software is 2GB (after nulls are removed), the total size for 20VM''s could be ~6GB (remembering there will be extra redundancy the more VM''s you add)- not a bad saving. Thanks Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-12-10 at 17:53 +0000, Oliver Mattos wrote:> > > 2) Keep a tree of checksums for data blocks, so that a bit of data can > > > be located by it''s checksum. Whenever a data block is about to be > > > written check if the block matches any known block, and if it does then > > > don''t bother duplicating the data on disk. I suspect this option may > > > not be realistic for performance reasons. > > > > > > > When compression was added, the writeback path was changed to make > > option #2 viable, at least in the case where the admin is willing to > > risk hash collisions on strong hashes. When the a direct read > > comparison is required before sharing blocks, it is probably best done > > by a stand alone utility, since we don''t want wait for a read of a full > > extent every time we want to write on. > > > > Can we assume hash collisions won''t occur? I mean if it''s a 256 bit > hash then even with 256TB of data, and one hash per block, the chances > of collision are still too small to calculate on gnome calculator.It depends on the use case. We can assume that if someone really wants collisions to occur they will eventually be able to trigger them. So, the code should have the option to verify the bytes are identical via a read. For a backup farm or virtualization dataset, I personally wouldn''t use the read-verify stage. But others will. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Dec 2008 8:19:03 am Ray Van Dolson wrote:> I''m not sure why this hasn''t caught on, but as soon as a solid and fast > implementation of it exists in the Linux world I really think it can > catch on for VM datastores.... I know we''ve hollered at Sun as to why > they haven''t rolled it out for ZFS yet!Might NetApp have patented the technique ? Would be a similar situation to the issues [1] that the KSM work [2] fell into with trying to prevent duplication of pages. cheers, Chris [1] - http://lwn.net/Articles/309155/ [2] - http://lwn.net/Articles/306704/ -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Oliver Mattos <oliver.mattos08@imperial.ac.uk> on Thu, Dec 11 00:18:> > > It would be interesting to see how many duplicate *blocks* there are > > across the filesystem, agnostic to files...Here is my contribution. It''s a perl script that goes through every block (of various block sizes) on a device and prints out summaries. It uses BerkeleyDB to store the hashes as it goes as there are an enormous number for smaller block sizes. It checks every block, not just the used blocks, so on a file system less that 100% full it really isn''t a proper test. It requires root privileges in order to read the raw block device. My root filesystem on my Debian box, 20G, 9.2G used: root: 512 byte blocks 27.07% duplicate. root: 1024 byte blocks 26.20% duplicate. root: 4096 byte blocks 23.17% duplicate. root: 8192 byte blocks 15.31% duplicate. root: 16384 byte blocks 9.57% duplicate. For my home partition, 20G, 29% used home: 512 byte blocks 32.11% duplicate. home: 1024 byte blocks 31.16% duplicate. home: 4096 byte blocks 29.85% duplicate. home: 8192 byte blocks 21.31% duplicate. home: 16384 byte blocks 10.84% duplicate. There is, of course, a fair amount of overhead for smaller blocks: 512b 3.91% overhead 1k 1.95% overhead 4k 0.49% overhead 8k 0.24% overhead 16k 0.12% overhead Though, 4% overhead for 27-32% savings seems pretty impressive. Again, it checksums all blocks, not just used blocks, so the whole test might be bogus. -- Life is wasted on the living.