thr3ads.net - zfs discuss - [zfs-discuss] why both dedup and compression? [May 2010]

If this information is useful, please help other people find it:
Share via:

Richard Jahnel

2010-May-06 01:06 UTC

[zfs-discuss] why both dedup and compression?

I''ve googled this for a bit, but can''t seem to find the
answer.

What does compression bring to the party that dedupe doesn''t cover
already?

Thank you for you patience and answers.
-- 
This message posted from opensolaris.org

Alex Blewitt

2010-May-06 01:25 UTC

head link

[zfs-discuss] why both dedup and compression?

Dedup came much later than compression. Also, compression saves both  
space and therefore load time even when there''s only one copy. It is  
especially good for e.g. HTML or man page documentation which tends to  
compress very well (versus binary formats like images or MP3s that  
don''t).

It gives me an extra, say, 10g on my laptop''s 80g SSD which
isn''t bad.

Alex

Sent from my (new) iPhone

On 6 May 2010, at 02:06, Richard Jahnel <richard at ellipseinc.com> wrote:
> I''ve googled this for a bit, but can''t seem to find the
answer.
>
> What does compression bring to the party that dedupe doesn''t cover
> already?
>
> Thank you for you patience and answers.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard L. Hamilton

2010-May-06 01:44 UTC

head link

[zfs-discuss] why both dedup and compression?

> I''ve googled this for a bit, but can''t seem to find
> the answer.
> 
> What does compression bring to the party that dedupe
> doesn''t cover already?
> 
> Thank you for you patience and answers.
That almost sounds like a classroom question.

Pick a simple example: large text files, of which each is
unique, maybe lines of data or something.  Not likely to
be much in the way of duplicate blocks to share, but
very likely to be highly compressible.

Contrast that with binary files, which might have blocks
of zero bytes in them (without being strictly sparse, sometimes).
With deduping, one such block is all that''s actually stored (along
with all the references to it, of course).

In the 30 seconds or so I''ve been thinking about it to type this,
I would _guess_ that one might want one or the other, but
rarely both, since compression might tend to work against deduplication.

So given the availability of both, and how lightweight zfs filesystems
are, one might want to create separate filesystems within a pool with
one or the other as appropriate, and separate the data according to
which would likely work better on it.  Also, one might as well
put compressed video, audio, and image formats in a filesystem
that was _not_ compressed, since compressing an already compressed
file seldom gains much if anything more.
-- 
This message posted from opensolaris.org

Richard L. Hamilton

2010-May-06 02:11 UTC

head link

[zfs-discuss] why both dedup and compression?

Another thought is this: _unless_ the CPU is the bottleneck on
a particular system, compression (_when_ it actually helps) can
speed up overall operation, by reducing the amount of I/O needed.
But storing already-compressed files in a filesystem with compression
is likely to result in wasted effort, with little or no gain to show for it.

Even deduplication requires some extra effort.  Looking at the documentation,
it implies a particular checksum algorithm _plus_ verification (if the checksum
or digest matches, then make sure by doing a byte-for-byte compare of the
blocks, since nothing shorter than the data itself can _guarantee_ that
they''re the same, just like no lossless compression can possibly work
for
all possible bitstreams).

So doing either of these where the success rate is likely to be too low
is probably not helpful.

There are stats that show the savings for a filesystem due to compression
or deduplication.  What I think would be interesting is some advice as to
how much (percentage) savings one should be getting to expect to come
out ahead not just on storage, but on overall system performance.  Of
course, no such guidance would exactly fit any particular workload, but
I think one might be able to come up with some approximate numbers,
or at least a range, below which those features probably represented
a waste of effort unless space was at an absolute premium.
-- 
This message posted from opensolaris.org

Erik Trimble

2010-May-06 02:32 UTC

head link

[zfs-discuss] why both dedup and compression?

One of the big things to remember with dedup is that it is
block-oriented (as is compression) - it deals with things in discrete
chunks, (usually) not the entire file as a stream. So, let''s do a
thought-experiment here:

File A is 100MB in size. From ZFS''s standpoint, let''s say
it''s made up
of 100 1MB blocks (or chunks, or slabs). Let''s also say that none of
the
blocks are identical (which is highly likely) - that is, no block
checksums identically.

Thus, with dedup on, this file takes up 100MB of space.  If I do a "cp
fileA fileB", no more additional space will be taken up. 

However, let''s say I then add 1 bit of data to the very front of file
A.
Now, block alignments have changed for the entire file, so all the 1MB
blocks checksum differently. Thus, in this case, adding 1 bit of data to
file A actually causes 100MB+1bit of new data to be used, as now none of
file B''s block are the same as file A.  Therefore, after 1 additional
bit has been written, total disk usage is 200MB+1 bit.

If compression were being used, file A originally would likely take up <
100MB, and file B would take up the same amount; thus, the two together
could take up, say 150MB together (with a conservative 25% compression
ratio). After writing 1 new bit to file A, file A almost certainly
compresses the same as before, so the two files will continue to occupy
150MB of space.

Compression is not obsoleted by dedup. They both have their places,
depending on the data being stored, and the usage pattern of that data.

-Erik

On Wed, 2010-05-05 at 19:11 -0700, Richard L. Hamilton
wrote:> Another thought is this: _unless_ the CPU is the bottleneck on
> a particular system, compression (_when_ it actually helps) can
> speed up overall operation, by reducing the amount of I/O needed.
> But storing already-compressed files in a filesystem with compression
> is likely to result in wasted effort, with little or no gain to show for
it.
> 
> Even deduplication requires some extra effort.  Looking at the
documentation,
> it implies a particular checksum algorithm _plus_ verification (if the
checksum
> or digest matches, then make sure by doing a byte-for-byte compare of the
> blocks, since nothing shorter than the data itself can _guarantee_ that
> they''re the same, just like no lossless compression can possibly
work for
> all possible bitstreams).
> 
> So doing either of these where the success rate is likely to be too low
> is probably not helpful.
> 
> There are stats that show the savings for a filesystem due to compression
> or deduplication.  What I think would be interesting is some advice as to
> how much (percentage) savings one should be getting to expect to come
> out ahead not just on storage, but on overall system performance.  Of
> course, no such guidance would exactly fit any particular workload, but
> I think one might be able to come up with some approximate numbers,
> or at least a range, below which those features probably represented
> a waste of effort unless space was at an absolute premium.
-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Richard Jahnel

2010-May-06 03:35 UTC

head link

[zfs-discuss] why both dedup and compression?

Hmm...

To clarify.

Every discussion or benchmarking that I have seen always show both off,
compression only or both on.

Why never compression off and dedup on?

After some further thought... perhaps it''s because compression works at
the byte level and dedup is at the block level. Perhaps I have answered my own
question.

Some confirmation would be nice though.
-- 
This message posted from opensolaris.org

Ian Collins

2010-May-06 03:39 UTC

head link

[zfs-discuss] why both dedup and compression?

On 05/ 6/10 03:35 PM, Richard Jahnel wrote:> Hmm...
>
> To clarify.
>
> Every discussion or benchmarking that I have seen always show both off,
compression only or both on.
>
> Why never compression off and dedup on?
>
> After some further thought... perhaps it''s because compression
works at the byte level and dedup is at the block level. Perhaps I have answered
my own question.
>
>    Data that don''t compress also tends to be data that doesn''t
dedup well
(media files for example).

-- 
Ian.

Richard Elling

2010-May-06 05:48 UTC

head link

[zfs-discuss] why both dedup and compression?

On May 5, 2010, at 8:35 PM, Richard Jahnel wrote:
> Hmm...
> 
> To clarify.
> 
> Every discussion or benchmarking that I have seen always show both off,
compression only or both on.
> 
> Why never compression off and dedup on?
I''ve seen this quite often.  The decision to compress is based on the 
compressibility of the data.  The decision to dedup is based on the 
duplication of the data.
> After some further thought... perhaps it''s because compression
works at the byte level and dedup is at the block level. Perhaps I have answered
my own question.
Both work at the block level. Hence, they are complementary.
Two identical blocks will compress identically, and then dedup.
 -- richard

-- 
ZFS storage and performance consulting at http://www.RichardElling.com

Peter Tribble

2010-May-06 07:39 UTC

head link

[zfs-discuss] why both dedup and compression?

On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel <richard at ellipseinc.com>
wrote:> I''ve googled this for a bit, but can''t seem to find the
answer.
>
> What does compression bring to the party that dedupe doesn''t cover
already?
Compression will reduce the storage requirements for non-duplicate data.

As an example, I have a system that I rsync the web application data
from a whole
bunch of servers (zones) to. There''s a fair amount of duplication in
the application
files (java, tomcat, apache, and the like) so dedup is a big win. On
the other hand,
there''s essentially no duplication whatsoever in the log files, which
are pretty big,
but compress really well. So having both enabled works really well.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Michael Sullivan

2010-May-06 18:10 UTC

head link

[zfs-discuss] why both dedup and compression?

This is interesting, but what about iSCSI volumes for virtual machines?

Compress or de-dupe?  Assuming the virtual machine was made from a clone of the
original iSCSI or a master iSCSI volume.

Does anyone have any real world data this?  I would think the iSCSI volumes
would diverge quite a bit over time even with compression and/or de-duplication.

Just curious?

On 6 May 2010, at 16:39 , Peter Tribble wrote:
> On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel <richard at
ellipseinc.com> wrote:
>> I''ve googled this for a bit, but can''t seem to find
the answer.
>> 
>> What does compression bring to the party that dedupe doesn''t
cover already?
> 
> Compression will reduce the storage requirements for non-duplicate data.
> 
> As an example, I have a system that I rsync the web application data
> from a whole
> bunch of servers (zones) to. There''s a fair amount of duplication
in
> the application
> files (java, tomcat, apache, and the like) so dedup is a big win. On
> the other hand,
> there''s essentially no duplication whatsoever in the log files,
which
> are pretty big,
> but compress really well. So having both enabled works really well.
> 
> -- 
> -Peter Tribble
> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mike

---
Michael Sullivan                   
michael.p.sullivan at me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

Erik Trimble

2010-May-06 20:07 UTC

head link

[zfs-discuss] why both dedup and compression?

On Fri, 2010-05-07 at 03:10 +0900, Michael Sullivan
wrote:> This is interesting, but what about iSCSI volumes for virtual machines?
> 
> Compress or de-dupe?  Assuming the virtual machine was made from a clone of
the original iSCSI or a master iSCSI volume.
> 
> Does anyone have any real world data this?  I would think the iSCSI volumes
would diverge quite a bit over time even with compression and/or de-duplication.
> 
> Just curious?
> 
VM OS storage is an ideal candidate for dedup, and NOT compression (for
the most part).

VM images contain large quantities of executable files, most of which
compress poorly, if at all. However, having 20 copies of the same
Window2003 VM image makes for very nice dedup.



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Darren J Moffat

2010-May-07 08:32 UTC

head link

[zfs-discuss] why both dedup and compression?

On 06/05/2010 21:07, Erik Trimble wrote:> VM images contain large quantities of executable files, most of which
> compress poorly, if at all.
What data are you basing that generalisation on ?

Look at these simple examples for libc on my OpenSolaris machine:

1.6M  /usr/lib/libc.so.1*
636K  /tmp/libc.gz

I did the same thing for vim and got pretty much the same result.

It will be different (probably not quite as good) when it is at the ZFS 
block level rather than whole file but those to randomly choosen by me 
samples say otherwise to your generalisation.

Several people have also found that enabling ZFS compression on their 
root datasets is worth while - and those are pretty much the same kind 
of content as a VM image of Solaris.

Remember also that unless you are very CPU bound you might actually 
improve performance from enabling compression.  This isn''t new to ZFS, 
people (my self included) used to do this back in MS-DOS days with 
Stacker and Doublespace.

Also OS images these days have lots of configuration files which tend to 
be text based formats and those compress very well.

-- 
Darren J Moffat

Casper.Dik at Sun.COM

2010-May-07 09:41 UTC

head link

[zfs-discuss] why both dedup and compression?

>On 06/05/2010 21:07, Erik Trimble wrote:
>> VM images contain large quantities of executable files, most of which
>> compress poorly, if at all.
>
>What data are you basing that generalisation on ?
>
>Look at these simple examples for libc on my OpenSolaris machine:
>
>1.6M  /usr/lib/libc.so.1*
>636K  /tmp/libc.gz
>
>I did the same thing for vim and got pretty much the same result.
>It will be different (probably not quite as good) when it is at the ZFS 
>block level rather than whole file but those to randomly choosen by me 
>samples say otherwise to your generalisation.
Easy to test when "compression" is enabled for your rpool:

2191 -rwxr-xr-x   1 root     bin      1794552 May  6 14:46 /usr/lib/libc.so.1*

(The actual size is 3500 blocks so we''re saving quite a bit)


Casper

David Magda

2010-May-07 12:49 UTC

head link

[zfs-discuss] why both dedup and compression?

On Fri, May 7, 2010 04:32, Darren J Moffat wrote:
> Remember also that unless you are very CPU bound you might actually
> improve performance from enabling compression.  This isn''t new to
ZFS,
> people (my self included) used to do this back in MS-DOS days with
> Stacker and Doublespace.
CPU has been "cheaper" in many circumstances than I/O for quite a
while.
Gray and Putzolu formulated the "five-minute rule" back in 1987:
> The 5-minute random rule: cache randomly accessed disk pages that are
> re-used every 5 minutes.
http://en.wikipedia.org/wiki/Five-minute_rule

They re-visited it in 1997 and 2007, and it still holds:
> The 20-year-old five-minute rule for RAM and disks still holds, but for
> ever-larger disk pages. Moreover, it should be augmented by two new
> five-minute rules: one for small pages moving between RAM and flash
> memory and one for large pages moving between flash memory and
> traditional disks.
http://tinyurl.com/m9hrv4
http://cacm.acm.org/magazines/2009/7/32091-the-five-minute-rule-20-years-later/fulltext


Avoiding (disk) I/O has been desirable for quite a while now. Someone in
comp.arch (Terje Mathisen?) used to have the following in his signature: 
"almost all programming can be viewed as an exercise in caching."

Dennis Clarke

2010-May-07 15:23 UTC

head link

[zfs-discuss] why both dedup and compression?

> On 06/05/2010 21:07, Erik Trimble wrote:
>> VM images contain large quantities of executable files, most of which
>> compress poorly, if at all.
>
> What data are you basing that generalisation on ?
note : I can''t believe someone said that.

warning : I just detected a fast rise time on my pedantic input line and I
am in full geek herd mode :
http://www.blastwave.org/dclarke/blog/?q=node/160

The degree to which a file can be compressed is often related to the
degree of randomness or "entropy" in the bit sequences in that file.
We
tend to look at files in chunks of bits called "bytes" or
"words" or
"blocks" of some given length but the harsh reality is that it is just
a
sequence of ones and zero values and nothing more. However I can spot
blocks or patterns in there and then create tokens that represent
repeating blocks. If you want a really random file that you are certain
has nearly perfect high entropy then just get a coin and flip it 1024
times while recording the heads and tails results. Then input that data
into a file as a sequence of ones and zero bits and you have a very neatly
random chunk of data.

Good luck trying to compress that thing.

Pardon me .. here it comes. I spent waay too many years in labs doing work
with RNG hardware and software to just look the other way. And I''m in a
good mood.

Suppose that C is soem discrete random variable. That means that C can
have well defined values like HEAD or TAIL. You usually have a bunch ( n
of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of
those shows up in the data set with specific propabilities p1, p2, p3,
..., pn where the sum of those add to exactly one. This means that x1 will
appear in the dataset with an "expected" probability of p1. All of
those
propabilities are expressed as a value between 0 and 1. A value of 1 means
"certainty". Okay, so in the case of a coin ( not the one in Bat Man
The
Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such
that p1+p2 = 1 exactly unless the coin lands on its edge and the universe
collapses due to entropy implosion. That is a joke. I used to teach this
as a TA in university so bear with me.

So go flip a coin a few thousand times and you will get fairly random
data. That is a Random Number Generator that you have and its always
kicking around your lab or in your pocket or on the street. Pretty cheap
but the baud rate is hellish low.

If you get tired of flipping bits using a coin then you may have to just
give up on that ( or buy a radioactive source where you can monitor
particles emitted as it decays for input data ) OR be really cheap and
look at /dev/urandom on a decent Solaris machine :

$ ls -lap /dev/urandom
lrwxrwxrwx   1 root     root          34 Jul  3  2008 /dev/urandom ->
../devices/pseudo/random at 0:urandom

That thing right there is a pseudo random number generator. It will make
for really random data but there is no promise that over a given number of
bits that the sum p1 + p2 will be precisely 1.  It will be real real close
however to a very random ( high entropy ) data source.

Need 1024 bits of random data ?

$ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom
0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be
0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54
0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58
0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44
0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56
0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df
0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c
0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3
0000080

There ya go. That was faster than flipping a coin eh? ( my Canadian bit
just flipped )

So you were saying ( or someone somewhere had the crazy idea that ZFS with
dedupe and compression enabled ) won''t really be of great benefit
because
of all the binary files in the filesystem. Well thats just nuts. Sorry but
it is. Those binary files are made up of ELF headers and opcodes from a
specific set of opcodes for a given architecture and that means the input
set C consists of a "discrete set of possible values" and NOT pure
random
high entropy data.

Want a demo ?

Here :

(1) take a nice big lib

$ uname -a
SunOS aequitas 5.11 snv_138 i86pc i386 i86pc
$ ls -lap /usr/lib | awk ''{ print $5 " " $9 }'' |
sort -n | tail
4784548 libwx_gtk2u_core-2.8.so.0.6.0
4907156 libgtkmm-2.4.so.1.1.0
6403701 llib-lX11.ln
8939956 libicudata.so.2
9031420 libgs.so.8.64
9300228 libCg.so
9916268 libicudata.so.3
14046812 libicudata.so.40.1
21747700 libmlib.so.2
40736972 libwireshark.so.0.0.1

$ cp /usr/lib/libwireshark.so.0.0.1 /tmp

$ ls -l /tmp/libwireshark.so.0.0.1
-r-xr-xr-x   1 dclarke  csw      40736972 May  7 14:20
/tmp/libwireshark.so.0.0.1

What is the SHA256 hash for that file ?

$ cd /tmp

Now compress it with gzip ( a good test case ) :

$ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1
libwireshark.so.0.0.1:   76.1% -- replaced with libwireshark.so.0.0.1.gz

$ ls -l libwireshark.so.0.0.1.gz
-r-xr-xr-x   1 dclarke  csw      9754053 May  7 14:20
libwireshark.so.0.0.1.gz

$ bc
scale=9
9754053/40736972
0.239439814

I see compression there.

Let''s see what happens with really random data :

$ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192
8192+0 records in
8192+0 records out
$ ls -l /tmp/foo.dat
-rw-r--r--   1 dclarke  csw      67108864 May  7 15:21 /tmp/foo.dat

$ ls -l /tmp/foo.dat.gz
-rw-r--r--   1 dclarke  csw      67119130 May  7 15:21 /tmp/foo.dat.gz

QED.

Erik Trimble

2010-May-08 00:27 UTC

head link

[zfs-discuss] why both dedup and compression?

Dennis is correct, in that compressibility is inversely related to
randomness.

And, also that binaries have nice commonality of symbols and headers.

All of which goes to excellent DEDUP. But not necessarily real good
compression - since what we need for compression is duplication /inside/
each file. 



However, VM images aren''t all binaries, (typical OS images have text
files and pre-compressed stuff all over the place, e.g. man pages,
Windows .cab files, text config files, etc. plus the standard binary
executables).  So, let''s see what a typical example brings us...




As a sample, I just compressed a basic installation of Windows 2000 (C:
\WINNT).

Here''s what I found:

Using a standard LZSS encoding tool (DECODE, aka that used in gzip, ZIP,
PDF, etc.), I get about a 40% compression (i.e. ~1.6 ratio).  1.8GB ->
858MB

Using ZFS filesystem compression of the same data, I get 1.58x
compression ratio, 2.2GB -> 1.4GB


A ZFS filesystem with dedup on, using the same WINNT data, produces a
1.86x dedup factor.  (2.15GB "used", for 1.17GB allocated)


Finally, a ZFS with compression & dedup turned on:  compressratio 1.66x 
dedupratio = 1.77x,  1.27GB data stored, 741MB allocated
(remember, this is for a 2.2GB data set, so about 3.0x total)



I also just did a quick rsync of my / partition on an Ubuntu 10.04 x64
system - I''m getting about 1.5x compressratio for that stuff (or, less
than 35% compression).  6.3GB -> 4.2GB



So, yes, I should have been more exact, in that system binaries /DO/
compress. However, they''re not excessively compressible, and in a
comparison between a fs with compression and one with dedup, dedup wins
hands down, even if I''m not storing lots of identical images. 
I''m still
not convinced that compression is really worthwhile for OS images, but
I''m open to talk...   :-)


Also, here''s the time output for each of the unzip operations
(I''m doing
this in a VirtualBox instance, so it''s a bit slow):

(no dedup, no compression)
real	1m41.865s
user	0m23.270s
sys	0m26.400s

(no dedup, compression)
real	1m40.465s
user	0m23.088s
sys	0m25.190s

(no compression, dedup)
real	2m1.400s
user	0m23.162s
sys	0m27.639s

(dedup & compression)
real	1m51.122s
user	0m23.294s
sys	0m26.953s




I''ll also be a bit more careful with the broad generalizations. :-)


-Erik






On Fri, 2010-05-07 at 11:23 -0400, Dennis Clarke wrote:> > On 06/05/2010 21:07, Erik Trimble wrote:
> >> VM images contain large quantities of executable files, most of
which
> >> compress poorly, if at all.
> >
> > What data are you basing that generalization on ?
> 
> note : I can''t believe someone said that.
> 
> warning : I just detected a fast rise time on my pedantic input line and I
> am in full geek herd mode :
> http://www.blastwave.org/dclarke/blog/?q=node/160
> 
> The degree to which a file can be compressed is often related to the
> degree of randomness or "entropy" in the bit sequences in that
file. We
> tend to look at files in chunks of bits called "bytes" or
"words" or
> "blocks" of some given length but the harsh reality is that it is
just a
> sequence of ones and zero values and nothing more. However I can spot
> blocks or patterns in there and then create tokens that represent
> repeating blocks. If you want a really random file that you are certain
> has nearly perfect high entropy then just get a coin and flip it 1024
> times while recording the heads and tails results. Then input that data
> into a file as a sequence of ones and zero bits and you have a very neatly
> random chunk of data.
> 
> Good luck trying to compress that thing.
> 
> Pardon me .. here it comes. I spent waay too many years in labs doing work
> with RNG hardware and software to just look the other way. And I''m
in a
> good mood.
> 
> Suppose that C is soem discrete random variable. That means that C can
> have well defined values like HEAD or TAIL. You usually have a bunch ( n
> of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of
> those shows up in the data set with specific propabilities p1, p2, p3,
> ..., pn where the sum of those add to exactly one. This means that x1 will
> appear in the dataset with an "expected" probability of p1. All
of those
> propabilities are expressed as a value between 0 and 1. A value of 1 means
> "certainty". Okay, so in the case of a coin ( not the one in Bat
Man The
> Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such
> that p1+p2 = 1 exactly unless the coin lands on its edge and the universe
> collapses due to entropy implosion. That is a joke. I used to teach this
> as a TA in university so bear with me.
> 
> So go flip a coin a few thousand times and you will get fairly random
> data. That is a Random Number Generator that you have and its always
> kicking around your lab or in your pocket or on the street. Pretty cheap
> but the baud rate is hellish low.
> 
> If you get tired of flipping bits using a coin then you may have to just
> give up on that ( or buy a radioactive source where you can monitor
> particles emitted as it decays for input data ) OR be really cheap and
> look at /dev/urandom on a decent Solaris machine :
> 
> $ ls -lap /dev/urandom
> lrwxrwxrwx   1 root     root          34 Jul  3  2008 /dev/urandom ->
> ../devices/pseudo/random at 0:urandom
> 
> That thing right there is a pseudo random number generator. It will make
> for really random data but there is no promise that over a given number of
> bits that the sum p1 + p2 will be precisely 1.  It will be real real close
> however to a very random ( high entropy ) data source.
> 
> Need 1024 bits of random data ?
> 
> $ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom
> 0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be
> 0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54
> 0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58
> 0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44
> 0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56
> 0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df
> 0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c
> 0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3
> 0000080
> 
> There ya go. That was faster than flipping a coin eh? ( my Canadian bit
> just flipped )
> 
> So you were saying ( or someone somewhere had the crazy idea that ZFS with
> dedupe and compression enabled ) won''t really be of great benefit
because
> of all the binary files in the filesystem. Well thats just nuts. Sorry but
> it is. Those binary files are made up of ELF headers and opcodes from a
> specific set of opcodes for a given architecture and that means the input
> set C consists of a "discrete set of possible values" and NOT
pure random
> high entropy data.
> 
> Want a demo ?
> 
> Here :
> 
> (1) take a nice big lib
> 
> $ uname -a
> SunOS aequitas 5.11 snv_138 i86pc i386 i86pc
> $ ls -lap /usr/lib | awk ''{ print $5 " " $9 }''
| sort -n | tail
> 4784548 libwx_gtk2u_core-2.8.so.0.6.0
> 4907156 libgtkmm-2.4.so.1.1.0
> 6403701 llib-lX11.ln
> 8939956 libicudata.so.2
> 9031420 libgs.so.8.64
> 9300228 libCg.so
> 9916268 libicudata.so.3
> 14046812 libicudata.so.40.1
> 21747700 libmlib.so.2
> 40736972 libwireshark.so.0.0.1
> 
> $ cp /usr/lib/libwireshark.so.0.0.1 /tmp
> 
> $ ls -l /tmp/libwireshark.so.0.0.1
> -r-xr-xr-x   1 dclarke  csw      40736972 May  7 14:20
> /tmp/libwireshark.so.0.0.1
> 
> What is the SHA256 hash for that file ?
> 
> $ cd /tmp
> 
> Now compress it with gzip ( a good test case ) :
> 
> $ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1
> libwireshark.so.0.0.1:   76.1% -- replaced with libwireshark.so.0.0.1.gz
> 
> $ ls -l libwireshark.so.0.0.1.gz
> -r-xr-xr-x   1 dclarke  csw      9754053 May  7 14:20
> libwireshark.so.0.0.1.gz
> 
> $ bc
> scale=9
> 9754053/40736972
> 0.239439814
> 
> I see compression there.
> 
> Let''s see what happens with really random data :
> 
> $ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192
> 8192+0 records in
> 8192+0 records out
> $ ls -l /tmp/foo.dat
> -rw-r--r--   1 dclarke  csw      67108864 May  7 15:21 /tmp/foo.dat
> 
> $ ls -l /tmp/foo.dat.gz
> -rw-r--r--   1 dclarke  csw      67119130 May  7 15:21 /tmp/foo.dat.gz
> 
> QED.
> 
> 
> 
> 
> 
> 
-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

zfs discuss - May 2010 - why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?

[zfs-discuss] why both dedup and compression?