Nothing too interesting here, I was just playing around with the idea
of a deduplication allocator for nbdkit (“allocator=dedup”, see
https://rwmj.wordpress.com/2020/06/15/compressed-ram-disks/).
Before implementing such a thing I wanted to know if there's much
duplicated structure in a disk image.  It seems to depend very
critically on the block size, but also there are no significant
savings to be had by deduplicating.
A test script I wrote is attached.  It takes any disk image that you
give it plus a block size, and produces a report on the potential
number of duplicated blocks at the given block size.  For example:
  $ guestfish -N fs:vfat -m /dev/sda1 tgz-in /tmp/nbdkit-1.21.10.tar.gz /
  $ ./dedup.pl test1.img 4096
  disk size = 1073741824
  
  blocks of zeros:                       259699 / 262144	99.1%
  blocks that could not be deduplicated: 2336 / 262144	0.9%
  
  number of duplicates: 22 count: 1      0.0%
  number of duplicates: 11 count: 2      0.0%
  number of duplicates: 9	 count: 1      0.0%
  number of duplicates: 8	 count: 1      0.0%
  number of duplicates: 6	 count: 1      0.0%
  number of duplicates: 3	 count: 2      0.0%
  number of duplicates: 2	 count: 18     0.0%
This means that > 99% of the disk is zeroes.  0.9% of the disk (2336 x
4k blocks) is unique and cannot be deduplicated.
For the remaining 4k blocks: 1 block was replicated 22 times.  2
blocks were replicated 11 times each.  1 block was replicated 9 times.
etc down to 18 blocks which had 2 copies each.
** But almost no savings could be made by deduplicating **
   99.1% - 0.9% ≈ 0
Adjusting the block size (and of course different disk images,
especially filesystem type and data stored) showed that we'd need to
use the smallest possible block size (ie. 512 bytes) in order to get
the best deduplication.  Same as above with 512 byte block size:
  $ ./dedup.pl test1.img 512
  disk size = 1073741824
  
  blocks of zeros:                       2080973 / 2097152	99.2%
  blocks that could not be deduplicated: 12952 / 2097152	0.6%
  
  number of duplicates: 160	count: 1	0.0%
  number of duplicates: 115	count: 1	0.0%
  number of duplicates: 89	count: 1	0.0%
  number of duplicates: 84	count: 1	0.0%
  number of duplicates: 74	count: 1	0.0%
  number of duplicates: 73	count: 1	0.0%
...
  number of duplicates: 5	count: 7	0.0%
  number of duplicates: 4	count: 37	0.0%
  number of duplicates: 3	count: 73	0.0%
  number of duplicates: 2	count: 332	0.0%
Note that even with the smaller block size, almost nothing would be
saved by deduplicating (only ≈0.2% of the disk).  Smaller block sizes
have quite a lot more overhead too, so there's a trade-off between
block size and metadata.
Changing the filesystem type makes some difference (but not enough to
be important).  Here's the same data as the first table above, but
using ext4 instead of vfat:
  $ guestfish -N fs:ext4 -m /dev/sda1 tgz-in /tmp/nbdkit-1.21.10.tar.gz /
  $ /var/tmp/dedup.pl test1.img 4096 
  disk size = 1073741824
  
  blocks of zeros:                       259317 / 262144	98.9%
  blocks that could not be deduplicated: 2359 / 262144	0.9%
  
  number of duplicates: 22 count: 1      0.0%
  number of duplicates: 11 count: 2      0.0%
  number of duplicates: 9	 count: 1      0.0%
  number of duplicates: 8	 count: 1      0.0%
  number of duplicates: 6	 count: 1      0.0%
  number of duplicates: 4	 count: 1      0.0%
  number of duplicates: 3	 count: 3      0.0%
  number of duplicates: 2	 count: 194    0.1%
Here's another one where I take a video file which about 11% of the
virtual size of the disk (and of course compressed, so random and
pretty much the worst case for deduplication):
  $ guestfish -N fs:ext4 -m /dev/sda1 upload
/tmp/2020-02-rjones-goals-tech-talk.mp4 /video.mp4
  $ /var/tmp/dedup.pl test1.img 4096 
  disk size = 1073741824
  
  blocks of zeros:                       236161 / 262144	90.1%
  blocks that could not be deduplicated: 25966 / 262144	9.9%
  
  number of duplicates: 4	 count: 1      0.0%
  number of duplicates: 3	 count: 1      0.0%
  number of duplicates: 2	 count: 5      0.0%
Rich.
-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v
--LQksG6bCIzRHxTLp
Content-Type: application/x-perl
Content-Disposition: attachment; filename="dedup.pl"
Content-Transfer-Encoding: quoted-printable
#!/usr/bin/perl -w=0A=0Ause strict;=0A=0Ause Digest::MD5 qw(md5_hex);=0A=0Adie
"$0 disk blocksize\n" unless @ARGV =3D=3D 2;=0A=0Amy $bs =3D
$ARGV[1];=0A=0Amy %h;=0A=0Amy $blocks =3D 0;=0Amy $zeroes =3D 0;=0Aopen DISK,
$ARGV[0] or die "open: $!";=0Amy $size =3D (stat
($ARGV[0]))[7];=0Aprint "disk size =3D $size\n";=0Aprint
"\n";=0A=0Afor (my $i =3D 0; $i < $size; $i +=3D $bs) {=0A    my
$buf;=0A    read DISK, $buf, $bs;=0A    die (length $buf, " !=3D ",
$bs) unless length $buf =3D=3D $bs;=0A    if ($buf eq ("\0" x $bs))
{=0A        $zeroes++;=0A    }=0A    else {=0A        my $digest =3D md5_hex
($buf);=0A        $h{$digest} =3D 0 unless exists $h{$digest};=0A       
$h{$digest}++;=0A    }=0A    $blocks++;=0A}=0A=0Amy %hist;=0Amy $lonely =3D
0;=0Aforeach (keys %h) {=0A    my $n =3D $h{$_};=0A    if ($n > 1) {=0A      
$hist{$n} =3D 0 unless exists $hist{$n};=0A        $hist{$n}++;=0A    } else
{=0A        $lonely++;=0A    }=0A}=0A=0Aprintf ("blocks of zeros:          
%s / %s\t%.1f%%\n", $zeroes, $blocks, 100 * $zeroes / $blocks);=0Aprintf
("blocks that could not be deduplicated: %s / %s\t%.1f%%\n", $lonely,
$blocks, 100 * $lonely / $blocks);=0Aprint "\n";=0Aforeach (sort { $b
<=3D> $a }  keys %hist) {=0A    printf ("number of duplicates:
%d\tcount: %d\t%.1f%%\n", $_, $hist{$_}, 100 * $hist{$_} / $blocks);=0A}=0A
--LQksG6bCIzRHxTLp--