thr3ads.net - Btrfs devel - Data De-duplication [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Oliver Mattos

2008-Dec-09 22:48 UTC

Data De-duplication

Hi,

Say I download a large file from the net to /mnt/a.iso.  I then download
the same file again to /mnt/b.iso.  These files now have the same
content, but are stored twice since the copies weren''t made with the
bcp
utility.

The same occurs if a directory tree with duplicate files (created with
bcp) is put through a non-aware program - for example tarred and then
untarred again.

This could be improved in two ways:

1)  Make a utility which checks the checksums for all the data extents,
and if the checksums of data match for two files then check the file
data, and if the file data matches then keep only one copy.  It could be
run as a cron job to free up disk space on systems where duplicate data
is common (eg. virtual machine images)

2)  Keep a tree of checksums for data blocks, so that a bit of data can
be located by it''s checksum.  Whenever a data block is about to be
written check if the block matches any known block, and if it does then
don''t bother duplicating the data on disk.  I suspect this option may
not be realistic for performance reasons.

If either is possible then thought needs to be put into if it''s worth
doing on a file level, or a partial-file level (ie. if I have two
similar files, can the space used by the identical parts of the files be
saved)

Has any thought been put into either 1) or 2) - are either possible or
desired?

Thanks
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miguel Figueiredo Mascarenhas Sousa Filipe

2008-Dec-10 11:52 UTC

head link

Re: Data De-duplication

Hi all,

On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos
<oliver.mattos08@imperial.ac.uk> wrote:> Hi,
>
> Say I download a large file from the net to /mnt/a.iso.  I then download
> the same file again to /mnt/b.iso.  These files now have the same
> content, but are stored twice since the copies weren''t made with
the bcp
> utility.
>
> The same occurs if a directory tree with duplicate files (created with
> bcp) is put through a non-aware program - for example tarred and then
> untarred again.
>
> This could be improved in two ways:
>
> 1)  Make a utility which checks the checksums for all the data extents,
> and if the checksums of data match for two files then check the file
> data, and if the file data matches then keep only one copy.  It could be
> run as a cron job to free up disk space on systems where duplicate data
> is common (eg. virtual machine images)
>
I have a perl script that does this, made by a friend of mine.
First it sweeps a dir/ and stats() every file, putting all files with
same size X in a linked list on a hashtable entry for size X.
Then it will md5sum all files with same bytesize to confirm if they
really are copies of each others.

Because if first only stats, and only md5sum files with "potencial"
duplicates, its faster than regular scripts I''ve seen.

Do you want this ?

> 2)  Keep a tree of checksums for data blocks, so that a bit of data can
> be located by it''s checksum.  Whenever a data block is about to be
> written check if the block matches any known block, and if it does then
> don''t bother duplicating the data on disk.  I suspect this option
may
> not be realistic for performance reasons.
>
> If either is possible then thought needs to be put into if it''s
worth
> doing on a file level, or a partial-file level (ie. if I have two
> similar files, can the space used by the identical parts of the files be
> saved)
>
> Has any thought been put into either 1) or 2) - are either possible or
> desired?
>
> Thanks
> Oliver
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2008-Dec-10 13:30 UTC

head link

Re: Data De-duplication

On Tue, 2008-12-09 at 22:48 +0000, Oliver Mattos wrote:> Hi,
> 
> Say I download a large file from the net to /mnt/a.iso.  I then download
> the same file again to /mnt/b.iso.  These files now have the same
> content, but are stored twice since the copies weren''t made with
the bcp
> utility.
> 
> The same occurs if a directory tree with duplicate files (created with
> bcp) is put through a non-aware program - for example tarred and then
> untarred again.
> 
> This could be improved in two ways:
> 
> 1)  Make a utility which checks the checksums for all the data extents,
> and if the checksums of data match for two files then check the file
> data, and if the file data matches then keep only one copy.  It could be
> run as a cron job to free up disk space on systems where duplicate data
> is common (eg. virtual machine images)
> 
Sage did extend the ioctl used by bcp to be able to deal with ranges of
files bytes.  So, it could be used to do the  actual cow step for this
utility.
> 2)  Keep a tree of checksums for data blocks, so that a bit of data can
> be located by it''s checksum.  Whenever a data block is about to be
> written check if the block matches any known block, and if it does then
> don''t bother duplicating the data on disk.  I suspect this option
may
> not be realistic for performance reasons.
> 
When compression was added, the writeback path was changed to make
option #2 viable, at least in the case where the admin is willing to
risk hash collisions on strong hashes.  When the a direct read
comparison is required before sharing blocks, it is probably best done
by a stand alone utility, since we don''t want wait for a read of a full
extent every time we want to write on.
> If either is possible then thought needs to be put into if it''s
worth
> doing on a file level, or a partial-file level (ie. if I have two
> similar files, can the space used by the identical parts of the files be
> saved)
>From the kernel, the easiest granularity is the block level.
> 
> Has any thought been put into either 1) or 2) - are either possible or
> desired?
These are definitely on the long term feature list.  I don''t think
we''ll
get them before 1.0, but its an important feature.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

seth huang

2008-Dec-10 13:30 UTC

head link

Re: Data De-duplication

De-duplication is useful in data backup systems because of the high level of
data redundancy, but I''m not sure whether it is necessary for a
general-purposed fs. If you really want to do so, I will suggest the latter one.
File-level de-dup can be implemented in a user-level application.

As I googled the internet, I found a previous discussion on this topic in the
mail archive. I hope this will help.
http://markmail.org/message/sdxos4s2bnckven5#query:+page:1+mid:sdxos4s2bnckven5+state:results

On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos
<oliver.mattos08@imperial.ac.uk> wrote:> Hi,
>
> Say I download a large file from the net to /mnt/a.iso.  I then download
> the same file again to /mnt/b.iso.  These files now have the same
> content, but are stored twice since the copies weren''t made with
the bcp
> utility.
>
> The same occurs if a directory tree with duplicate files (created with
> bcp) is put through a non-aware program - for example tarred and then
> untarred again.
>
> This could be improved in two ways:
>
> 1)  Make a utility which checks the checksums for all the data extents,
> and if the checksums of data match for two files then check the file
> data, and if the file data matches then keep only one copy.  It could be
> run as a cron job to free up disk space on systems where duplicate data
> is common (eg. virtual machine images)
>
> 2)  Keep a tree of checksums for data blocks, so that a bit of data can
> be located by it''s checksum.  Whenever a data block is about to be
> written check if the block matches any known block, and if it does then
> don''t bother duplicating the data on disk.  I suspect this option
may
> not be realistic for performance reasons.
>
> If either is possible then thought needs to be put into if it''s
worth
> doing on a file level, or a partial-file level (ie. if I have two
> similar files, can the space used by the identical parts of the files be
> saved)
>
> Has any thought been put into either 1) or 2) - are either possible or
> desired?
>
> Thanks
> Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-10 17:53 UTC

head link

Re: Data De-duplication

> > 2)  Keep a tree of checksums for data blocks, so that a bit of data
can
> > be located by it''s checksum.  Whenever a data block is about
to be
> > written check if the block matches any known block, and if it does
then
> > don''t bother duplicating the data on disk.  I suspect this
option may
> > not be realistic for performance reasons.
> > 
> 
> When compression was added, the writeback path was changed to make
> option #2 viable, at least in the case where the admin is willing to
> risk hash collisions on strong hashes.  When the a direct read
> comparison is required before sharing blocks, it is probably best done
> by a stand alone utility, since we don''t want wait for a read of a
full
> extent every time we want to write on.
> 
Can we assume hash collisions won''t occur?  I mean if it''s a
256 bit
hash then even with 256TB of data, and one hash per block, the chances
of collision are still too small to calculate on gnome calculator.

The only issue is if the hash algorithm is later found to be flawed, a
malicious bit of data could be stored on the disk who''s hash would
collide with some more important data, potentially allowing the contents
of one file to be replaced with another.

Even if we don''t assume hash collisions won''t occur (eg. for
crc''s), the
write performance when writing duplicate files is equal to the read
performance of the disk, since for every block written by a program, one
block will need to be read, and no blocks written.  This is still better
than the write case (as most devices read faster than write), and has
the added advantage of saving lots of space.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-10 21:10 UTC

head link

Re: Data De-duplication

On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts
wrote:> > When the a direct read
> > comparison is required before sharing blocks, it is probably best done
> > by a stand alone utility, since we don''t want wait for a read
of a full
> > extent every time we want to write on.
> 
> Can a stand-alone utility do this without a race? Eg, if a portion of one
> of the files has already been read by the util, but is changed before the
> util has a chance to actually do the ioctl to duplicate the range.
> 
> If we assume someone is keeping a hash table of "likely" matches,
might it
> make sense to have a verify-and-dup ioctl that does this safely?
> 
This sounds good because then a standard user could safely do this to
any file as long as he had read access to both files, but wouldn''t need
write access (since the operation doesn''t actually modify the file
data).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ray Van Dolson

2008-Dec-10 21:19 UTC

head link

Re: Data De-duplication

I lost the original post so I''m jumping in at the wrong thread-point :)
Someone mentioned that the primary usage of de-dup is in the backup
realm.  True perhaps currently, but de-dup IMO is *the* killer app in
the world of virtualization and is a huge reason why we''re picking
NetApp at work to back our NFS VMware DataStores.  We easily see 50%
savings in space.

I know of only one other production filesystem implementation of
data-dedup -- GreenBytes has it in their ZFS-based storage product.

I''m not sure why this hasn''t caught on, but as soon as a solid
and fast
implementation of it exists in the Linux world I really think it can
catch on for VM datastores.... I know we''ve hollered at Sun as to why
they haven''t rolled it out for ZFS yet!

Anyways, I know it''s on the roadmap, just like throwing my $0.02 once
in a while on how big a feature I think this could be......

Great job all :)
Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-10 21:42 UTC

head link

Re: Data De-duplication

I see quite a few uses for this, and while it looks like the kernel mode
automatic de-dup-on-write code might be performance costly, require disk
format changes, and be controversial, it sounds like the user mode
utility could be implemented today.

It looks like a simple script could do the job - just iterate through
every file in the filesystem, run md5sum on every block of every file,
whenever a duplicate is found call an ioctl to remove the duplicate
data.  By md5summing each block it can also effectively compress disk
images.

While not very efficient it should work, and having something like this
in the toolkit would mean as soon as btrfs gets stable enough for
everyday use it would immediately out-do every other linux filesystem in
terms of space efficiency for some workloads.

In the long term kernel mode de-duplication would probably be good. 
I''m
willing to bet even the average user has say 1-2% of data duplicated
somewhere on the HD due to accidental copies instead of moves, same
application installed to two different paths, two users who happen to
have the same file each saved in their home folder, etc, so even the
average user will slightly benefit.

I''m considering writing that script to test on my ext3 disk just to see
how much duplicate wasted data I really have.

Thanks
Oliver


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tracy Reed

2008-Dec-10 21:57 UTC

head link

Re: Data De-duplication

On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake
thusly:> I''m considering writing that script to test on my ext3 disk just
to see
> how much duplicate wasted data I really have.
Check out the fdupes command. In Fedora 8 it is in the yum repo as
fdupes-1.40-10.fc8

-- 
Tracy Reed
http://cuttedge.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-10 22:06 UTC

head link

Re: Data De-duplication

On Wed, 2008-12-10 at 13:57 -0800, Tracy Reed wrote:> On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake thusly:
> > I''m considering writing that script to test on my ext3 disk
just to see
> > how much duplicate wasted data I really have.
> 
> Check out the fdupes command. In Fedora 8 it is in the yum repo as
> fdupes-1.40-10.fc8
> FDupes is good for whole file matches, but btrfs supports partial-file
duplication elimination (eg. where two big files share some parts in
common but other parts have changed).  Fdupes would be a good start
though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ray Van Dolson

2008-Dec-10 22:10 UTC

head link

Re: Data De-duplication

On Wed, Dec 10, 2008 at 01:57:54PM -0800, Tracy Reed
wrote:> On Wed, Dec 10, 2008 at 09:42:16PM +0000, Oliver Mattos spake thusly:
> > I''m considering writing that script to test on my ext3 disk
just to see
> > how much duplicate wasted data I really have.
> 
> Check out the fdupes command. In Fedora 8 it is in the yum repo as
> fdupes-1.40-10.fc8
> 
Neat tool.  I guess this just checks for duplicate files though.  It
would be interesting to see how many duplicate *blocks* there are
across the filesystem, agnostic to files...

Is this somthing your script does Oliver?

Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-11 00:18 UTC

head link

Re: Data De-duplication

> It would be interesting to see how many duplicate *blocks* there are
> across the filesystem, agnostic to files...
> 
> Is this somthing your script does Oliver?
My script doesn''t yet exist, although when created it would, yes.  I
was
thinking of just making a BASH script and using dd to extract 512 byte
chunks of files, pipe through md5sum and save the result in a large
index file.  Next just iterate through the index file looking for
duplicate hashes.

In fact that sounds so easy I might do it right now...  (only to proof
of concept stage - a real utility would probably want to be written in a
compiled language and use proper trees for faster searching)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-11 03:42 UTC

head link

Re: Data De-duplication

Here is a script to locate duplicate data WITHIN files:

On some test file sets of binary data with no duplicated files, about 3%
of the data blocks were duplicated, and about 0.1% of the data blocks
were nulls.  The data was mainly elf and win32 binaries plus some random
game data, office documents and a few images.

This code is hideously slow, so don''t give it more than a couple of MB
of files to chew through at once.  In retrospect I should''ve just
written it in plain fast C instead of fighting with bash pipes!

Note to get "verbose" output, just remove everything after the word
"sort" in the code.

___________________  V  CODE  V  ____________________________

#!/bin/bash

# **********************************************************#
#                   Redunt data detector                    #
#                                                           #
# Simple Script to take an MD5 hash of every block in every #
# file in a folder and detect identical blocks              #
#                                                           #
# Copyright 2008 Oliver Mattos, Released under the GPL.     #
# **********************************************************#

#  WARNING - This script is very inefficient, so don''t run it
#            with more than 50,000 blocks at once.

: ${1?"Usage: $0 PATH [BlockSize]"}

BS=${2-512}  #Block Size in bytes, can be specified on command line
NULLCOUNT="0"
DUPCOUNT="0"
TOTCOUNT="0"

NULLHASH=`dd if=/dev/zero bs=$BS count=1 2>/dev/null | md5sum -b`
NULLHASH="${NULLHASH:0:32}"

find "$1" | \
while read i; do
  if [ "` stat "$i" -c%f `" == "81a4" ]; then
    LEN=` stat "$i" -c%s `
    BC=0
    while [ $LEN -gt $[$BC * $BS] ]; do
      echo `dd if="$i" bs=$BS count=1 skip=$BC 2>/dev/null | md5sum
-b`
$BC $i
      BC=$[ BC + 1 ]
    done
  fi;
done | sort | while read j; do
  OLDHASH=$HASH
  HASH=${j:0:32}
  TOTCOUNT=$[ $TOTCOUNT + 1 ]

  if [ "$HASH" == "$OLDHASH" ]; then
    DUPCOUNT=$[ DUPCOUNT + 1 ]
    if [ "$HASH" == "$NULLHASH" ]; then
      NULLCOUNT=$[ NULLCOUNT + 1 ]
    fi
  fi

  echo Hashed $TOTCOUNT $BS byte blocks, found $DUPCOUNT redundant \
       blocks of data, of which $NULLCOUNT blocks were null.
done | tail -n 1

# these last two lines are a bodge because the variables dont seem to
# come out of the while properly, probably something todo with the 
# pipes...


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ray Van Dolson

2008-Dec-11 03:50 UTC

head link

Re: Data De-duplication

On Thu, Dec 11, 2008 at 03:42:58AM +0000, Oliver Mattos
wrote:> Here is a script to locate duplicate data WITHIN files:
> 
> On some test file sets of binary data with no duplicated files, about 3%
> of the data blocks were duplicated, and about 0.1% of the data blocks
> were nulls.  The data was mainly elf and win32 binaries plus some random
> game data, office documents and a few images.
> 
> This code is hideously slow, so don''t give it more than a couple
of MB
> of files to chew through at once.  In retrospect I should''ve just
> written it in plain fast C instead of fighting with bash pipes!
> 
> Note to get "verbose" output, just remove everything after the
word
> "sort" in the code.
Neat.  Thanks much.  It''d be cool to output the results of each of your
hashes to a database so you can get a feel for how many duplicate
blocks there are cross-files as well.

I''d like to run this in a similar setup on all my VMware VMDK files and
get an idea of how much space savings there would be across 20+ Windows
2003 VMDK files... probably *lots* of common blocks.

Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oliver Mattos

2008-Dec-11 09:58 UTC

head link

Re: Data De-duplication

> Neat.  Thanks much.  It''d be cool to output the results of each of
your
> hashes to a database so you can get a feel for how many duplicate
> blocks there are cross-files as well.
> 
> I''d like to run this in a similar setup on all my VMware VMDK
files and
> get an idea of how much space savings there would be across 20+ Windows
> 2003 VMDK files... probably *lots* of common blocks.
> 
> RayHi,

Currently it DOES do cross file block matching - thats why it takes sooo
long to run :-)

If you remove everything after the word "sort", it will make a verbose
output, which you could then stick into some SQL database if you wanted.
You could relativey easily format the output into a format for direct
input to an SQL database if you modified the line with the "dd" in it
within the first while.  You can also remove the "sort" and the pipe
before it to get an unsorted output - the advantage of this is it takes
less time.

I''m guessing that if you had the time to run this on multi-gigabyte
disk
images you''d find that as much as 80% of the blocks are duplicated
between any two virtual machines of the same operating system.

That means if you have 20 Win 2k3 VM''s and the first VM image of
Windows
+ software is 2GB (after nulls are removed), the total size for 20VM''s
could be ~6GB (remembering there will be extra redundancy the more VM''s
you add)- not a bad saving.

Thanks
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2008-Dec-11 15:12 UTC

head link

Re: Data De-duplication

On Wed, 2008-12-10 at 17:53 +0000, Oliver Mattos wrote:> > > 2)  Keep a tree of checksums for data blocks, so that a bit of
data can
> > > be located by it''s checksum.  Whenever a data block is
about to be
> > > written check if the block matches any known block, and if it
does then
> > > don''t bother duplicating the data on disk.  I suspect
this option may
> > > not be realistic for performance reasons.
> > > 
> > 
> > When compression was added, the writeback path was changed to make
> > option #2 viable, at least in the case where the admin is willing to
> > risk hash collisions on strong hashes.  When the a direct read
> > comparison is required before sharing blocks, it is probably best done
> > by a stand alone utility, since we don''t want wait for a read
of a full
> > extent every time we want to write on.
> > 
> 
> Can we assume hash collisions won''t occur?  I mean if
it''s a 256 bit
> hash then even with 256TB of data, and one hash per block, the chances
> of collision are still too small to calculate on gnome calculator.
It depends on the use case.  We can assume that if someone really wants
collisions to occur they will eventually be able to trigger them.  So,
the code should have the option to verify the bytes are identical via  a
read.

For a backup farm or virtualization dataset, I personally wouldn''t use
the read-verify stage.  But others will.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Samuel

2008-Dec-14 12:25 UTC

head link

Re: Data De-duplication

On Thu, 11 Dec 2008 8:19:03 am Ray Van Dolson wrote:
> I''m not sure why this hasn''t caught on, but as soon as a
solid and fast
> implementation of it exists in the Linux world I really think it can
> catch on for VM datastores.... I know we''ve hollered at Sun as to
why
> they haven''t rolled it out for ZFS yet!
Might NetApp have patented the technique ?

Would be a similar situation to the issues [1] that the KSM work [2] fell into 
with trying to prevent duplication of pages.

cheers,
Chris

[1] - http://lwn.net/Articles/309155/
[2] - http://lwn.net/Articles/306704/

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omen Wild

2008-Dec-14 19:37 UTC

head link

Re: Data De-duplication

Quoting Oliver Mattos <oliver.mattos08@imperial.ac.uk> on Thu, Dec 11
00:18:>
> > It would be interesting to see how many duplicate *blocks* there are
> > across the filesystem, agnostic to files...
Here is my contribution.  It''s a perl script that goes through every
block (of various block sizes) on a device and prints out summaries.
It uses BerkeleyDB to store the hashes as it goes as there are an
enormous number for smaller block sizes.

It checks every block, not just the used blocks, so on a file system
less that 100% full it really isn''t a proper test.

It requires root privileges in order to read the raw block device.

My root filesystem on my Debian box, 20G, 9.2G used:
root:   512 byte blocks 27.07% duplicate.
root:  1024 byte blocks 26.20% duplicate.
root:  4096 byte blocks 23.17% duplicate.
root:  8192 byte blocks 15.31% duplicate.
root: 16384 byte blocks 9.57% duplicate.

For my home partition, 20G, 29% used
home:   512 byte blocks 32.11% duplicate.
home:  1024 byte blocks 31.16% duplicate.
home:  4096 byte blocks 29.85% duplicate.
home:  8192 byte blocks 21.31% duplicate.
home: 16384 byte blocks 10.84% duplicate.

There is, of course, a fair amount of overhead for smaller blocks:
512b 3.91% overhead
1k   1.95% overhead
4k   0.49% overhead
8k   0.24% overhead
16k  0.12% overhead

Though, 4% overhead for 27-32% savings seems pretty impressive.  Again,
it checksums all blocks, not just used blocks, so the whole test might
be bogus.

-- 
Life is wasted on the living.

Apparently Analagous Threads

Search for more seemingly similar threads

Btrfs devel - Dec 2008 - Data De-duplication

Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Re: Data De-duplication

Apparently Analagous Threads