I''ve googled this for a bit, but can''t seem to find the answer. What does compression bring to the party that dedupe doesn''t cover already? Thank you for you patience and answers. -- This message posted from opensolaris.org
Dedup came much later than compression. Also, compression saves both space and therefore load time even when there''s only one copy. It is especially good for e.g. HTML or man page documentation which tends to compress very well (versus binary formats like images or MP3s that don''t). It gives me an extra, say, 10g on my laptop''s 80g SSD which isn''t bad. Alex Sent from my (new) iPhone On 6 May 2010, at 02:06, Richard Jahnel <richard at ellipseinc.com> wrote:> I''ve googled this for a bit, but can''t seem to find the answer. > > What does compression bring to the party that dedupe doesn''t cover > already? > > Thank you for you patience and answers. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> I''ve googled this for a bit, but can''t seem to find > the answer. > > What does compression bring to the party that dedupe > doesn''t cover already? > > Thank you for you patience and answers.That almost sounds like a classroom question. Pick a simple example: large text files, of which each is unique, maybe lines of data or something. Not likely to be much in the way of duplicate blocks to share, but very likely to be highly compressible. Contrast that with binary files, which might have blocks of zero bytes in them (without being strictly sparse, sometimes). With deduping, one such block is all that''s actually stored (along with all the references to it, of course). In the 30 seconds or so I''ve been thinking about it to type this, I would _guess_ that one might want one or the other, but rarely both, since compression might tend to work against deduplication. So given the availability of both, and how lightweight zfs filesystems are, one might want to create separate filesystems within a pool with one or the other as appropriate, and separate the data according to which would likely work better on it. Also, one might as well put compressed video, audio, and image formats in a filesystem that was _not_ compressed, since compressing an already compressed file seldom gains much if anything more. -- This message posted from opensolaris.org
Another thought is this: _unless_ the CPU is the bottleneck on a particular system, compression (_when_ it actually helps) can speed up overall operation, by reducing the amount of I/O needed. But storing already-compressed files in a filesystem with compression is likely to result in wasted effort, with little or no gain to show for it. Even deduplication requires some extra effort. Looking at the documentation, it implies a particular checksum algorithm _plus_ verification (if the checksum or digest matches, then make sure by doing a byte-for-byte compare of the blocks, since nothing shorter than the data itself can _guarantee_ that they''re the same, just like no lossless compression can possibly work for all possible bitstreams). So doing either of these where the success rate is likely to be too low is probably not helpful. There are stats that show the savings for a filesystem due to compression or deduplication. What I think would be interesting is some advice as to how much (percentage) savings one should be getting to expect to come out ahead not just on storage, but on overall system performance. Of course, no such guidance would exactly fit any particular workload, but I think one might be able to come up with some approximate numbers, or at least a range, below which those features probably represented a waste of effort unless space was at an absolute premium. -- This message posted from opensolaris.org
One of the big things to remember with dedup is that it is block-oriented (as is compression) - it deals with things in discrete chunks, (usually) not the entire file as a stream. So, let''s do a thought-experiment here: File A is 100MB in size. From ZFS''s standpoint, let''s say it''s made up of 100 1MB blocks (or chunks, or slabs). Let''s also say that none of the blocks are identical (which is highly likely) - that is, no block checksums identically. Thus, with dedup on, this file takes up 100MB of space. If I do a "cp fileA fileB", no more additional space will be taken up. However, let''s say I then add 1 bit of data to the very front of file A. Now, block alignments have changed for the entire file, so all the 1MB blocks checksum differently. Thus, in this case, adding 1 bit of data to file A actually causes 100MB+1bit of new data to be used, as now none of file B''s block are the same as file A. Therefore, after 1 additional bit has been written, total disk usage is 200MB+1 bit. If compression were being used, file A originally would likely take up < 100MB, and file B would take up the same amount; thus, the two together could take up, say 150MB together (with a conservative 25% compression ratio). After writing 1 new bit to file A, file A almost certainly compresses the same as before, so the two files will continue to occupy 150MB of space. Compression is not obsoleted by dedup. They both have their places, depending on the data being stored, and the usage pattern of that data. -Erik On Wed, 2010-05-05 at 19:11 -0700, Richard L. Hamilton wrote:> Another thought is this: _unless_ the CPU is the bottleneck on > a particular system, compression (_when_ it actually helps) can > speed up overall operation, by reducing the amount of I/O needed. > But storing already-compressed files in a filesystem with compression > is likely to result in wasted effort, with little or no gain to show for it. > > Even deduplication requires some extra effort. Looking at the documentation, > it implies a particular checksum algorithm _plus_ verification (if the checksum > or digest matches, then make sure by doing a byte-for-byte compare of the > blocks, since nothing shorter than the data itself can _guarantee_ that > they''re the same, just like no lossless compression can possibly work for > all possible bitstreams). > > So doing either of these where the success rate is likely to be too low > is probably not helpful. > > There are stats that show the savings for a filesystem due to compression > or deduplication. What I think would be interesting is some advice as to > how much (percentage) savings one should be getting to expect to come > out ahead not just on storage, but on overall system performance. Of > course, no such guidance would exactly fit any particular workload, but > I think one might be able to come up with some approximate numbers, > or at least a range, below which those features probably represented > a waste of effort unless space was at an absolute premium.-- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Hmm... To clarify. Every discussion or benchmarking that I have seen always show both off, compression only or both on. Why never compression off and dedup on? After some further thought... perhaps it''s because compression works at the byte level and dedup is at the block level. Perhaps I have answered my own question. Some confirmation would be nice though. -- This message posted from opensolaris.org
On 05/ 6/10 03:35 PM, Richard Jahnel wrote:> Hmm... > > To clarify. > > Every discussion or benchmarking that I have seen always show both off, compression only or both on. > > Why never compression off and dedup on? > > After some further thought... perhaps it''s because compression works at the byte level and dedup is at the block level. Perhaps I have answered my own question. > >Data that don''t compress also tends to be data that doesn''t dedup well (media files for example). -- Ian.
On May 5, 2010, at 8:35 PM, Richard Jahnel wrote:> Hmm... > > To clarify. > > Every discussion or benchmarking that I have seen always show both off, compression only or both on. > > Why never compression off and dedup on?I''ve seen this quite often. The decision to compress is based on the compressibility of the data. The decision to dedup is based on the duplication of the data.> After some further thought... perhaps it''s because compression works at the byte level and dedup is at the block level. Perhaps I have answered my own question.Both work at the block level. Hence, they are complementary. Two identical blocks will compress identically, and then dedup. -- richard -- ZFS storage and performance consulting at http://www.RichardElling.com
On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel <richard at ellipseinc.com> wrote:> I''ve googled this for a bit, but can''t seem to find the answer. > > What does compression bring to the party that dedupe doesn''t cover already?Compression will reduce the storage requirements for non-duplicate data. As an example, I have a system that I rsync the web application data from a whole bunch of servers (zones) to. There''s a fair amount of duplication in the application files (java, tomcat, apache, and the like) so dedup is a big win. On the other hand, there''s essentially no duplication whatsoever in the log files, which are pretty big, but compress really well. So having both enabled works really well. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
This is interesting, but what about iSCSI volumes for virtual machines? Compress or de-dupe? Assuming the virtual machine was made from a clone of the original iSCSI or a master iSCSI volume. Does anyone have any real world data this? I would think the iSCSI volumes would diverge quite a bit over time even with compression and/or de-duplication. Just curious? On 6 May 2010, at 16:39 , Peter Tribble wrote:> On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel <richard at ellipseinc.com> wrote: >> I''ve googled this for a bit, but can''t seem to find the answer. >> >> What does compression bring to the party that dedupe doesn''t cover already? > > Compression will reduce the storage requirements for non-duplicate data. > > As an example, I have a system that I rsync the web application data > from a whole > bunch of servers (zones) to. There''s a fair amount of duplication in > the application > files (java, tomcat, apache, and the like) so dedup is a big win. On > the other hand, > there''s essentially no duplication whatsoever in the log files, which > are pretty big, > but compress really well. So having both enabled works really well. > > -- > -Peter Tribble > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussMike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034
On Fri, 2010-05-07 at 03:10 +0900, Michael Sullivan wrote:> This is interesting, but what about iSCSI volumes for virtual machines? > > Compress or de-dupe? Assuming the virtual machine was made from a clone of the original iSCSI or a master iSCSI volume. > > Does anyone have any real world data this? I would think the iSCSI volumes would diverge quite a bit over time even with compression and/or de-duplication. > > Just curious? >VM OS storage is an ideal candidate for dedup, and NOT compression (for the most part). VM images contain large quantities of executable files, most of which compress poorly, if at all. However, having 20 copies of the same Window2003 VM image makes for very nice dedup. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On 06/05/2010 21:07, Erik Trimble wrote:> VM images contain large quantities of executable files, most of which > compress poorly, if at all.What data are you basing that generalisation on ? Look at these simple examples for libc on my OpenSolaris machine: 1.6M /usr/lib/libc.so.1* 636K /tmp/libc.gz I did the same thing for vim and got pretty much the same result. It will be different (probably not quite as good) when it is at the ZFS block level rather than whole file but those to randomly choosen by me samples say otherwise to your generalisation. Several people have also found that enabling ZFS compression on their root datasets is worth while - and those are pretty much the same kind of content as a VM image of Solaris. Remember also that unless you are very CPU bound you might actually improve performance from enabling compression. This isn''t new to ZFS, people (my self included) used to do this back in MS-DOS days with Stacker and Doublespace. Also OS images these days have lots of configuration files which tend to be text based formats and those compress very well. -- Darren J Moffat
>On 06/05/2010 21:07, Erik Trimble wrote: >> VM images contain large quantities of executable files, most of which >> compress poorly, if at all. > >What data are you basing that generalisation on ? > >Look at these simple examples for libc on my OpenSolaris machine: > >1.6M /usr/lib/libc.so.1* >636K /tmp/libc.gz > >I did the same thing for vim and got pretty much the same result.>It will be different (probably not quite as good) when it is at the ZFS >block level rather than whole file but those to randomly choosen by me >samples say otherwise to your generalisation.Easy to test when "compression" is enabled for your rpool: 2191 -rwxr-xr-x 1 root bin 1794552 May 6 14:46 /usr/lib/libc.so.1* (The actual size is 3500 blocks so we''re saving quite a bit) Casper
On Fri, May 7, 2010 04:32, Darren J Moffat wrote:> Remember also that unless you are very CPU bound you might actually > improve performance from enabling compression. This isn''t new to ZFS, > people (my self included) used to do this back in MS-DOS days with > Stacker and Doublespace.CPU has been "cheaper" in many circumstances than I/O for quite a while. Gray and Putzolu formulated the "five-minute rule" back in 1987:> The 5-minute random rule: cache randomly accessed disk pages that are > re-used every 5 minutes.http://en.wikipedia.org/wiki/Five-minute_rule They re-visited it in 1997 and 2007, and it still holds:> The 20-year-old five-minute rule for RAM and disks still holds, but for > ever-larger disk pages. Moreover, it should be augmented by two new > five-minute rules: one for small pages moving between RAM and flash > memory and one for large pages moving between flash memory and > traditional disks.http://tinyurl.com/m9hrv4 http://cacm.acm.org/magazines/2009/7/32091-the-five-minute-rule-20-years-later/fulltext Avoiding (disk) I/O has been desirable for quite a while now. Someone in comp.arch (Terje Mathisen?) used to have the following in his signature: "almost all programming can be viewed as an exercise in caching."
> On 06/05/2010 21:07, Erik Trimble wrote: >> VM images contain large quantities of executable files, most of which >> compress poorly, if at all. > > What data are you basing that generalisation on ?note : I can''t believe someone said that. warning : I just detected a fast rise time on my pedantic input line and I am in full geek herd mode : http://www.blastwave.org/dclarke/blog/?q=node/160 The degree to which a file can be compressed is often related to the degree of randomness or "entropy" in the bit sequences in that file. We tend to look at files in chunks of bits called "bytes" or "words" or "blocks" of some given length but the harsh reality is that it is just a sequence of ones and zero values and nothing more. However I can spot blocks or patterns in there and then create tokens that represent repeating blocks. If you want a really random file that you are certain has nearly perfect high entropy then just get a coin and flip it 1024 times while recording the heads and tails results. Then input that data into a file as a sequence of ones and zero bits and you have a very neatly random chunk of data. Good luck trying to compress that thing. Pardon me .. here it comes. I spent waay too many years in labs doing work with RNG hardware and software to just look the other way. And I''m in a good mood. Suppose that C is soem discrete random variable. That means that C can have well defined values like HEAD or TAIL. You usually have a bunch ( n of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of those shows up in the data set with specific propabilities p1, p2, p3, ..., pn where the sum of those add to exactly one. This means that x1 will appear in the dataset with an "expected" probability of p1. All of those propabilities are expressed as a value between 0 and 1. A value of 1 means "certainty". Okay, so in the case of a coin ( not the one in Bat Man The Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such that p1+p2 = 1 exactly unless the coin lands on its edge and the universe collapses due to entropy implosion. That is a joke. I used to teach this as a TA in university so bear with me. So go flip a coin a few thousand times and you will get fairly random data. That is a Random Number Generator that you have and its always kicking around your lab or in your pocket or on the street. Pretty cheap but the baud rate is hellish low. If you get tired of flipping bits using a coin then you may have to just give up on that ( or buy a radioactive source where you can monitor particles emitted as it decays for input data ) OR be really cheap and look at /dev/urandom on a decent Solaris machine : $ ls -lap /dev/urandom lrwxrwxrwx 1 root root 34 Jul 3 2008 /dev/urandom -> ../devices/pseudo/random at 0:urandom That thing right there is a pseudo random number generator. It will make for really random data but there is no promise that over a given number of bits that the sum p1 + p2 will be precisely 1. It will be real real close however to a very random ( high entropy ) data source. Need 1024 bits of random data ? $ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom 0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be 0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54 0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58 0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44 0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56 0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df 0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c 0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3 0000080 There ya go. That was faster than flipping a coin eh? ( my Canadian bit just flipped ) So you were saying ( or someone somewhere had the crazy idea that ZFS with dedupe and compression enabled ) won''t really be of great benefit because of all the binary files in the filesystem. Well thats just nuts. Sorry but it is. Those binary files are made up of ELF headers and opcodes from a specific set of opcodes for a given architecture and that means the input set C consists of a "discrete set of possible values" and NOT pure random high entropy data. Want a demo ? Here : (1) take a nice big lib $ uname -a SunOS aequitas 5.11 snv_138 i86pc i386 i86pc $ ls -lap /usr/lib | awk ''{ print $5 " " $9 }'' | sort -n | tail 4784548 libwx_gtk2u_core-2.8.so.0.6.0 4907156 libgtkmm-2.4.so.1.1.0 6403701 llib-lX11.ln 8939956 libicudata.so.2 9031420 libgs.so.8.64 9300228 libCg.so 9916268 libicudata.so.3 14046812 libicudata.so.40.1 21747700 libmlib.so.2 40736972 libwireshark.so.0.0.1 $ cp /usr/lib/libwireshark.so.0.0.1 /tmp $ ls -l /tmp/libwireshark.so.0.0.1 -r-xr-xr-x 1 dclarke csw 40736972 May 7 14:20 /tmp/libwireshark.so.0.0.1 What is the SHA256 hash for that file ? $ cd /tmp Now compress it with gzip ( a good test case ) : $ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1 libwireshark.so.0.0.1: 76.1% -- replaced with libwireshark.so.0.0.1.gz $ ls -l libwireshark.so.0.0.1.gz -r-xr-xr-x 1 dclarke csw 9754053 May 7 14:20 libwireshark.so.0.0.1.gz $ bc scale=9 9754053/40736972 0.239439814 I see compression there. Let''s see what happens with really random data : $ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192 8192+0 records in 8192+0 records out $ ls -l /tmp/foo.dat -rw-r--r-- 1 dclarke csw 67108864 May 7 15:21 /tmp/foo.dat $ ls -l /tmp/foo.dat.gz -rw-r--r-- 1 dclarke csw 67119130 May 7 15:21 /tmp/foo.dat.gz QED.
Dennis is correct, in that compressibility is inversely related to randomness. And, also that binaries have nice commonality of symbols and headers. All of which goes to excellent DEDUP. But not necessarily real good compression - since what we need for compression is duplication /inside/ each file. However, VM images aren''t all binaries, (typical OS images have text files and pre-compressed stuff all over the place, e.g. man pages, Windows .cab files, text config files, etc. plus the standard binary executables). So, let''s see what a typical example brings us... As a sample, I just compressed a basic installation of Windows 2000 (C: \WINNT). Here''s what I found: Using a standard LZSS encoding tool (DECODE, aka that used in gzip, ZIP, PDF, etc.), I get about a 40% compression (i.e. ~1.6 ratio). 1.8GB -> 858MB Using ZFS filesystem compression of the same data, I get 1.58x compression ratio, 2.2GB -> 1.4GB A ZFS filesystem with dedup on, using the same WINNT data, produces a 1.86x dedup factor. (2.15GB "used", for 1.17GB allocated) Finally, a ZFS with compression & dedup turned on: compressratio 1.66x dedupratio = 1.77x, 1.27GB data stored, 741MB allocated (remember, this is for a 2.2GB data set, so about 3.0x total) I also just did a quick rsync of my / partition on an Ubuntu 10.04 x64 system - I''m getting about 1.5x compressratio for that stuff (or, less than 35% compression). 6.3GB -> 4.2GB So, yes, I should have been more exact, in that system binaries /DO/ compress. However, they''re not excessively compressible, and in a comparison between a fs with compression and one with dedup, dedup wins hands down, even if I''m not storing lots of identical images. I''m still not convinced that compression is really worthwhile for OS images, but I''m open to talk... :-) Also, here''s the time output for each of the unzip operations (I''m doing this in a VirtualBox instance, so it''s a bit slow): (no dedup, no compression) real 1m41.865s user 0m23.270s sys 0m26.400s (no dedup, compression) real 1m40.465s user 0m23.088s sys 0m25.190s (no compression, dedup) real 2m1.400s user 0m23.162s sys 0m27.639s (dedup & compression) real 1m51.122s user 0m23.294s sys 0m26.953s I''ll also be a bit more careful with the broad generalizations. :-) -Erik On Fri, 2010-05-07 at 11:23 -0400, Dennis Clarke wrote:> > On 06/05/2010 21:07, Erik Trimble wrote: > >> VM images contain large quantities of executable files, most of which > >> compress poorly, if at all. > > > > What data are you basing that generalization on ? > > note : I can''t believe someone said that. > > warning : I just detected a fast rise time on my pedantic input line and I > am in full geek herd mode : > http://www.blastwave.org/dclarke/blog/?q=node/160 > > The degree to which a file can be compressed is often related to the > degree of randomness or "entropy" in the bit sequences in that file. We > tend to look at files in chunks of bits called "bytes" or "words" or > "blocks" of some given length but the harsh reality is that it is just a > sequence of ones and zero values and nothing more. However I can spot > blocks or patterns in there and then create tokens that represent > repeating blocks. If you want a really random file that you are certain > has nearly perfect high entropy then just get a coin and flip it 1024 > times while recording the heads and tails results. Then input that data > into a file as a sequence of ones and zero bits and you have a very neatly > random chunk of data. > > Good luck trying to compress that thing. > > Pardon me .. here it comes. I spent waay too many years in labs doing work > with RNG hardware and software to just look the other way. And I''m in a > good mood. > > Suppose that C is soem discrete random variable. That means that C can > have well defined values like HEAD or TAIL. You usually have a bunch ( n > of them ) of possible values x1, x2, x3, ..., xn that C can be. Each of > those shows up in the data set with specific propabilities p1, p2, p3, > ..., pn where the sum of those add to exactly one. This means that x1 will > appear in the dataset with an "expected" probability of p1. All of those > propabilities are expressed as a value between 0 and 1. A value of 1 means > "certainty". Okay, so in the case of a coin ( not the one in Bat Man The > Dark Knight ) you have x1=TAIL and x2=HEAD with ( we hope ) p1=0.5=p2 such > that p1+p2 = 1 exactly unless the coin lands on its edge and the universe > collapses due to entropy implosion. That is a joke. I used to teach this > as a TA in university so bear with me. > > So go flip a coin a few thousand times and you will get fairly random > data. That is a Random Number Generator that you have and its always > kicking around your lab or in your pocket or on the street. Pretty cheap > but the baud rate is hellish low. > > If you get tired of flipping bits using a coin then you may have to just > give up on that ( or buy a radioactive source where you can monitor > particles emitted as it decays for input data ) OR be really cheap and > look at /dev/urandom on a decent Solaris machine : > > $ ls -lap /dev/urandom > lrwxrwxrwx 1 root root 34 Jul 3 2008 /dev/urandom -> > ../devices/pseudo/random at 0:urandom > > That thing right there is a pseudo random number generator. It will make > for really random data but there is no promise that over a given number of > bits that the sum p1 + p2 will be precisely 1. It will be real real close > however to a very random ( high entropy ) data source. > > Need 1024 bits of random data ? > > $ /usr/xpg4/bin/od -Ax -N 128 -t x1 /dev/urandom > 0000000 ef c6 2b ba 29 eb dd ec 6d 73 36 06 58 33 c8 be > 0000010 53 fa 90 a2 a2 70 25 5f 67 1b c3 72 4f 26 c6 54 > 0000020 e9 83 44 c6 b9 45 3f 88 25 0c 4d c7 bc d5 77 58 > 0000030 d3 94 8e 4e e1 dd 71 02 dc c2 d0 19 f6 f4 5c 44 > 0000040 ff 84 56 9f 29 2a e5 00 33 d2 10 a4 d2 8a 13 56 > 0000050 d1 ac 86 46 4d 1e 2f 10 d9 0b 33 d7 c2 d4 ef df > 0000060 d9 a2 0b 7f 24 05 72 39 2d a6 75 25 01 bd 41 6c > 0000070 eb d9 4f 23 d9 ee 05 67 61 7c 8a 3d 5f 3a 76 e3 > 0000080 > > There ya go. That was faster than flipping a coin eh? ( my Canadian bit > just flipped ) > > So you were saying ( or someone somewhere had the crazy idea that ZFS with > dedupe and compression enabled ) won''t really be of great benefit because > of all the binary files in the filesystem. Well thats just nuts. Sorry but > it is. Those binary files are made up of ELF headers and opcodes from a > specific set of opcodes for a given architecture and that means the input > set C consists of a "discrete set of possible values" and NOT pure random > high entropy data. > > Want a demo ? > > Here : > > (1) take a nice big lib > > $ uname -a > SunOS aequitas 5.11 snv_138 i86pc i386 i86pc > $ ls -lap /usr/lib | awk ''{ print $5 " " $9 }'' | sort -n | tail > 4784548 libwx_gtk2u_core-2.8.so.0.6.0 > 4907156 libgtkmm-2.4.so.1.1.0 > 6403701 llib-lX11.ln > 8939956 libicudata.so.2 > 9031420 libgs.so.8.64 > 9300228 libCg.so > 9916268 libicudata.so.3 > 14046812 libicudata.so.40.1 > 21747700 libmlib.so.2 > 40736972 libwireshark.so.0.0.1 > > $ cp /usr/lib/libwireshark.so.0.0.1 /tmp > > $ ls -l /tmp/libwireshark.so.0.0.1 > -r-xr-xr-x 1 dclarke csw 40736972 May 7 14:20 > /tmp/libwireshark.so.0.0.1 > > What is the SHA256 hash for that file ? > > $ cd /tmp > > Now compress it with gzip ( a good test case ) : > > $ /opt/csw/bin/gzip -9v libwireshark.so.0.0.1 > libwireshark.so.0.0.1: 76.1% -- replaced with libwireshark.so.0.0.1.gz > > $ ls -l libwireshark.so.0.0.1.gz > -r-xr-xr-x 1 dclarke csw 9754053 May 7 14:20 > libwireshark.so.0.0.1.gz > > $ bc > scale=9 > 9754053/40736972 > 0.239439814 > > I see compression there. > > Let''s see what happens with really random data : > > $ dd if=/dev/urandom of=/tmp/foo.dat bs=8192 count=8192 > 8192+0 records in > 8192+0 records out > $ ls -l /tmp/foo.dat > -rw-r--r-- 1 dclarke csw 67108864 May 7 15:21 /tmp/foo.dat > > $ ls -l /tmp/foo.dat.gz > -rw-r--r-- 1 dclarke csw 67119130 May 7 15:21 /tmp/foo.dat.gz > > QED. > > > > > >-- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)