Folks I'm looking for a solution for backups because ZFS has failed on me too many times. In my environment, I have a large amount of data (around 2tb) that I periodically back up. I keep the last 5 "snapshots". I use rsync so that when I overwrite the oldest backup, most of the data is already there and the backup completes quickly, because only a small number of files have actually changed. Because of this low change rate, I have used ZFS with its deduplication feature to store the data. I started using a Centos-6 installation, and upgraded years ago to Centos7. Centos 8 is on my agenda. However, I've had several data-loss events with ZFS where because of a combination of errors and/or mistakes, the entire store was lost. I've also noticed that ZFS is maintained separately from Centos. At this moment, the Centos 8 update causes ZFS to fail. Looking for an alternate, I'm trying VDO. In the VDO installation, I created a logical volume containing two hard-drives, and defined VDO on top of that logical volume. It appears to be running, yet I find the deduplication numbers don't pass the smell test. I would expect that if the logical volume contains three copies of essentially identical data, I should see deduplication numbers close to 3.00, but instead I'm seeing numbers like 1.15. I compute the compression number as follows: Use df and extract the value for "1k blocks used" from the third column use vdostats --verbose and extract the number titled "1K-blocks used" Divide the first by the second. Can you provide any advice on my use of ZFS or VDO without telling me that I should be doing backups differently? Thanks David
Erick Perez - Quadrian Enterprises
2020-May-03 03:07 UTC
[CentOS] Understanding VDO vs ZFS
My two cents: 1- Do you have an encrypted filesystem on top of VDO? If yes, you will see no benefit from dedupe. 2- can you post the stats of vdostats ?verbose /dev/mapper/xxxxx (replace with your device) you can do something like: "vdostats -verbose /dev/mapper/xxxxxxxx | grep -B6 'save percentage' On Sat, May 2, 2020 at 9:54 PM david <david at daku.org> wrote:> Folks > > I'm looking for a solution for backups because ZFS has failed on me > too many times. In my environment, I have a large amount of data > (around 2tb) that I periodically back up. I keep the last 5 > "snapshots". I use rsync so that when I overwrite the oldest backup, > most of the data is already there and the backup completes quickly, > because only a small number of files have actually changed. > > Because of this low change rate, I have used ZFS with its > deduplication feature to store the data. I started using a Centos-6 > installation, and upgraded years ago to Centos7. Centos 8 is on my > agenda. However, I've had several data-loss events with ZFS where > because of a combination of errors and/or mistakes, the entire store > was lost. I've also noticed that ZFS is maintained separately from > Centos. At this moment, the Centos 8 update causes ZFS to > fail. Looking for an alternate, I'm trying VDO. > > In the VDO installation, I created a logical volume containing two > hard-drives, and defined VDO on top of that logical volume. It > appears to be running, yet I find the deduplication numbers don't > pass the smell test. I would expect that if the logical volume > contains three copies of essentially identical data, I should see > deduplication numbers close to 3.00, but instead I'm seeing numbers > like 1.15. I compute the compression number as follows: > Use df and extract the value for "1k blocks used" from the third column > use vdostats --verbose and extract the number titled "1K-blocks used" > > Divide the first by the second. > > Can you provide any advice on my use of ZFS or VDO without telling me > that I should be doing backups differently? > > Thanks > > David > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >-- --------------------- Erick Perez
Erick Perez - Quadrian Enterprises
2020-May-03 05:33 UTC
[CentOS] Understanding VDO vs ZFS
sorry corrections: For this test I created a 40GB lvm volume group with /dev/sdb and /dev/sdc then a 40GB LV then a 60GB VDO vol (for testing purposes) vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' output from just created vdoas [root at localhost ~]# vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' physical blocks : 10483712 logical blocks : 15728640 1K-blocks : 41934848 1K-blocks used : 4212024 1K-blocks available : 37722824 used percent : 10 saving percent : 99 [root at localhost ~]# FIRST copy CentOS-7-x86_64-Minimal-2003.iso (1.1G) to vdoas from source outside vdo volume [root at localhost ~]# vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 1K-blocks used : 4721348 1K-blocks available : 37213500 used percent : 11 saving percent : 9 SECOND copy CentOS-7-x86_64-Minimal-2003.iso (1.1G) to vdoas form source outside vdo volume #cp /root/CentOS-7-x86_64-Minimal-2003.iso /mnt/vdomounts/CentOS-7-x86_64-Minimal-2003-version2.iso 1K-blocks used : 5239012 1K-blocks available : 36695836 used percent : 12 saving percent : 52 THIRD copy CentOS-7-x86_64-Minimal-2003.iso (1.1G) to vdoas form inside vdo volume to inside vdo volume 1K-blocks used : 5248060 1K-blocks available : 36686788 used percent : 12 saving percent : 67 Then I did this a total of 9 more times to have 10 ISOs copied. Total data copied 10.6GB. Do note this: When using DF, it will show the VDO size, in my case 60G when using vdostats it will show the size of the LV, in my case 40G Remeber dedupe AND compression are enabled. The df -hT output shows the logical space occupied by these iso files as seen by the filesystem on the VDO volume. Since VDO manages a logical to physical block map, df sees logical space consumed according to the file system that resides on top of the VDO volume. vdostats --hu is viewing the physical block device as managed by VDO. Physically a single .ISO image is residing on the disk, but logically the file system thinks there are 10 copies, occupying 10.6GB. So at the end I have 10 .ISOs of 1086 1MB blocks (total 10860 1MB blocks) that yield these results: 1K-blocks used : 5248212 1K-blocks available : 36686636 used percent : 12 saving percent : 89 So at the end it is using 5248212 1K blocks minus 4212024 initial used 1K blocks, gives (5248212 - 4212024) = 1036188 1K blocks / 1024 = about 1012MB total. Hope this helps understanding where the space goes. BTW: Testing system is CentOS Linux release 7.8.2003 stock. with only "yum install vdo kmod-kvdo" History of commands: [root at localhost vdomounts]# history 2 pvcreate /dev/sdb 3 pvcreate /dev/sdc 8 vgcreate -v -A y vgvol01 /dev/sdb /dev/sdc 9 vgdisplay 13 lvcreate -l 100%FREE -n lvvdo01 vgvol01 14 yum install vdo kmod-kvdo 18 vdo create --name=vdoas --device=/dev/vgvol01/lvvdo01 --vdoLogicalSize=60G --writePolicy=async 19 mkfs.xfs -K /dev/mapper/vdoas 20 ls /mnt 21 mkdir /mnt/vdomounts 22 mount /dev/mapper/vdoas /mnt//vdomounts/ 26 vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 28 cp /root/CentOS-7-x86_64-Minimal-2003.iso /mnt/vdomounts/ -vvv 29 vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 30 cp /root/CentOS-7-x86_64-Minimal-2003.iso /mnt/vdomounts/CentOS-7-x86_64-Minimal-2003-version2.iso 31 vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 33 cd /mnt/vdomounts/ 35 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version3.iso 36 vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 37 df 39 vdostats --hu 40 ls -l --block-size=1MB /root/CentOS-7-x86_64-Minimal-2003.iso 41 df -hT 42 vdo status | grep Dedupl 43 vdostats --hu 44 vdostats 48 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version4.iso 49 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version5.iso 50 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version6.iso 51 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version7.iso 52 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version8.iso 53 cp CentOS-7-x86_64-Minimal-2003-version2.iso ./CentOS-7-x86_64-Minimal-2003-version9.iso 54 df -hT 55 ls -l --block-size=1MB 56 vdostats --hu 57 df -hT 58 df 59 vdostats --hu 60 vdostats 61 vdostats --verbose /dev/mapper/vdoas | grep -B6 'saving percent' 62 cat /etc/centos-release 63 history [root at localhost vdomounts]# On Sat, May 2, 2020 at 10:07 PM Erick Perez - Quadrian Enterprises < eperez at quadrianweb.com> wrote:> My two cents: > 1- Do you have an encrypted filesystem on top of VDO? If yes, you will see > no benefit from dedupe. > 2- can you post the stats of vdostats ?verbose /dev/mapper/xxxxx (replace > with your device) > > you can do something like: "vdostats -verbose /dev/mapper/xxxxxxxx | grep > -B6 'save percentage' > > > > > On Sat, May 2, 2020 at 9:54 PM david <david at daku.org> wrote: > >> Folks >> >> I'm looking for a solution for backups because ZFS has failed on me >> too many times. In my environment, I have a large amount of data >> (around 2tb) that I periodically back up. I keep the last 5 >> "snapshots". I use rsync so that when I overwrite the oldest backup, >> most of the data is already there and the backup completes quickly, >> because only a small number of files have actually changed. >> >> Because of this low change rate, I have used ZFS with its >> deduplication feature to store the data. I started using a Centos-6 >> installation, and upgraded years ago to Centos7. Centos 8 is on my >> agenda. However, I've had several data-loss events with ZFS where >> because of a combination of errors and/or mistakes, the entire store >> was lost. I've also noticed that ZFS is maintained separately from >> Centos. At this moment, the Centos 8 update causes ZFS to >> fail. Looking for an alternate, I'm trying VDO. >> >> In the VDO installation, I created a logical volume containing two >> hard-drives, and defined VDO on top of that logical volume. It >> appears to be running, yet I find the deduplication numbers don't >> pass the smell test. I would expect that if the logical volume >> contains three copies of essentially identical data, I should see >> deduplication numbers close to 3.00, but instead I'm seeing numbers >> like 1.15. I compute the compression number as follows: >> Use df and extract the value for "1k blocks used" from the third column >> use vdostats --verbose and extract the number titled "1K-blocks used" >> >> Divide the first by the second. >> >> Can you provide any advice on my use of ZFS or VDO without telling me >> that I should be doing backups differently? >> >> Thanks >> >> David >> >> _______________________________________________ >> CentOS mailing list >> CentOS at centos.org >> https://lists.centos.org/mailman/listinfo/centos >> > > > -- > > --------------------- > Erick Perez > >-- --------------------- Erick Perez Quadrian Enterprises S.A. - Panama, Republica de Panama Skype chat: eaperezh WhatsApp IM: +507-6675-5083 ---------------------
Il 03/05/20 04:50, david ha scritto:> Folks > > I'm looking for a solution for backups because ZFS has failed on me > too many times.? In my environment, I have a large amount of data > (around 2tb) that I periodically back up.? I keep the last 5 > "snapshots".? I use rsync so that when I overwrite the oldest backup, > most of the data is already there and the backup completes quickly, > because only a small number of files have actually changed. > > Because of this low change rate, I have used ZFS with its > deduplication feature to store the data.? I started using a Centos-6 > installation, and upgraded years ago to Centos7.? Centos 8 is on my > agenda.? However, I've had several data-loss events with ZFS where > because of a combination of errors and/or mistakes, the entire store > was lost.? I've also noticed that ZFS is maintained separately from > Centos.? At this moment, the Centos 8 update causes ZFS to fail.? > Looking for an alternate, I'm trying VDO. > > In the VDO installation, I created a logical volume containing two > hard-drives, and defined VDO on top of that logical volume.? It > appears to be running, yet I find the deduplication numbers don't pass > the smell test.? I would expect that if the logical volume contains > three copies of essentially identical data, I should see deduplication > numbers close to 3.00, but instead I'm seeing numbers like 1.15.? I > compute the compression number as follows: > ?Use df and extract the value for "1k blocks used" from the third column > ?use vdostats --verbose and extract the number titled "1K-blocks used" > > Divide the first by the second. > > Can you provide any advice on my use of ZFS or VDO without telling me > that I should be doing backups differently? > > Thanks > > David >Hi David, I'm not an expert about vdo but I will try it for backup purpose with rsync + hardlink. I know that this is not an answer you asked, sorry for this. Many user said me to use? more specific tool for running backup using deduplication (borg in my case). I'm testing it and I'm not sure if I will adopt it in the long term. As you reported, you are using rsync solution so I would ask: why not use a more specific tool? What are benefits to stay with rsync for you? Thank you in advance.
0At 08:07 PM 5/2/2020, you wrote:>My two cents: >1- Do you have an encrypted filesystem on top of VDO? If yes, you will see >no benefit from dedupe. >2- can you post the stats of vdostats ?verbose /dev/mapper/xxxxx (replace >with your device) > >you can do something like: "vdostats -verbose /dev/mapper/xxxxxxxx | grep >-B6 'save percentage' > > > > >On Sat, May 2, 2020 at 9:54 PM david <david at daku.org> wrote: > > > Folks > > > > I'm looking for a solution for backups because ZFS has failed on me > > too many times. In my environment, I have a large amount of data > > <snip>BTW: I think the 'saving percent' of 13 is consistent with my computation of 1.16 if one takes into account the overhead blocks. Is that true? -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vdostats.txt URL: <http://lists.centos.org/pipermail/centos/attachments/20200503/6f1b0b4a/attachment.txt>
Hi David, in my opinion, VDO isn't worth the effort. I tried VDO for the same use case: backups. My dataset is 2-3TB and I backup daily. Even with a smaller dataset, VDO couldn't stand up to it's promises. It used tons of CPU and memory and with a lot of tuning I could get it to kind of work, but it became corrupted at the slightest problem (even a shutdown could do this, and shutdowns could also take hours). I have tried a number of things and I use a combination of two things now: 1. a btrfs volume with force-compress enabled to store the intermediate data - it compresses my data to about 60% and that's enough for me 2. use of bup (https://bup.github.io/) to store long-term backups. bup is incredibly efficient for my use case (full VM backups). Over the course of a whole month, the dataset only increases by about 30% from the initial size (I create a new full backup each month) - and this is with FULL backups of all VMs every day. bup backupsets can also be mounted via FUSE, giving you access to all stored versions in a filesystem-like manner. If you can backup at will you can probably forego the btrfs volume for intermediate storage - that is just a band-aid to work around a specific issue here. Stefan -- ________________________________ From: CentOS <centos-bounces at centos.org> on behalf of david <david at daku.org> Sent: Sunday, May 3, 2020 2:50 AM To: centos at centos.org <centos at centos.org> Subject: [CentOS] Understanding VDO vs ZFS Folks I'm looking for a solution for backups because ZFS has failed on me too many times. In my environment, I have a large amount of data (around 2tb) that I periodically back up. I keep the last 5 "snapshots". I use rsync so that when I overwrite the oldest backup, most of the data is already there and the backup completes quickly, because only a small number of files have actually changed. Because of this low change rate, I have used ZFS with its deduplication feature to store the data. I started using a Centos-6 installation, and upgraded years ago to Centos7. Centos 8 is on my agenda. However, I've had several data-loss events with ZFS where because of a combination of errors and/or mistakes, the entire store was lost. I've also noticed that ZFS is maintained separately from Centos. At this moment, the Centos 8 update causes ZFS to fail. Looking for an alternate, I'm trying VDO. In the VDO installation, I created a logical volume containing two hard-drives, and defined VDO on top of that logical volume. It appears to be running, yet I find the deduplication numbers don't pass the smell test. I would expect that if the logical volume contains three copies of essentially identical data, I should see deduplication numbers close to 3.00, but instead I'm seeing numbers like 1.15. I compute the compression number as follows: Use df and extract the value for "1k blocks used" from the third column use vdostats --verbose and extract the number titled "1K-blocks used" Divide the first by the second. Can you provide any advice on my use of ZFS or VDO without telling me that I should be doing backups differently? Thanks David _______________________________________________ CentOS mailing list CentOS at centos.org https://lists.centos.org/mailman/listinfo/centos
On Sat, May 2, 2020 at 10:54 PM david <david at daku.org> wrote:> > Folks > > I'm looking for a solution for backups because ZFS has failed on me > too many times. In my environment, I have a large amount of data > (around 2tb) that I periodically back up. I keep the last 5 > "snapshots". I use rsync so that when I overwrite the oldest backup, > most of the data is already there and the backup completes quickly, > because only a small number of files have actually changed. > > Because of this low change rate, I have used ZFS with its > deduplication feature to store the data. I started using a Centos-6 > installation, and upgraded years ago to Centos7. Centos 8 is on my > agenda. However, I've had several data-loss events with ZFS where > because of a combination of errors and/or mistakes, the entire store > was lost. I've also noticed that ZFS is maintained separately from > Centos. At this moment, the Centos 8 update causes ZFS to > fail. Looking for an alternate, I'm trying VDO. > > In the VDO installation, I created a logical volume containing two > hard-drives, and defined VDO on top of that logical volume. It > appears to be running, yet I find the deduplication numbers don't > pass the smell test. I would expect that if the logical volume > contains three copies of essentially identical data, I should see > deduplication numbers close to 3.00, but instead I'm seeing numbers > like 1.15. I compute the compression number as follows: > Use df and extract the value for "1k blocks used" from the third column > use vdostats --verbose and extract the number titled "1K-blocks used"I'd like to know what kind of data you're looking to back up (that will just help get an idea of whether it's even a good fit for dedupe; though if it dedupes well on ZFS, it probably is fine). I'd also like to know how you configured your VDO volume (provide the 'vdo create' command you used). As mentioned in some other responses, can you provide vdostats (full 'vdostats --verbose' output as well as base 'vdostats') and df outputs for this volume? That would help understand a bit more on what you're experiencing. The default deduplication window for a VDO volume is set to ~250G (--indexMem=0.25). Assuming you're writing the full 2T of data each time and want to achieve deduplication across that entire 2T of data, it would require a "--indexMem=2G" configuration. You may want to account for growth as well, which means you may want to consider a larger amount of memory for the '--indexMem' parameter. An alternative, if memory isn't as plentiful, you could enable the sparse index option to cover a significantly larger dedupe window for a smaller amount of memory commitment. There is an additional on-disk footprint requirement that goes with it. You can look at the documentation [0] to find out those specific requirements. For this setup, a sparse index with default memory footprint (0.25G) would cover ~2.5T, but would require an additional ~20G of storage over the default index configuration. [0] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/deduplicating_and_compressing_storage/deploying-vdo_deduplicating-and-compressing-storage#vdo-memory-requirements_vdo-requirements> > Divide the first by the second. > > Can you provide any advice on my use of ZFS or VDO without telling me > that I should be doing backups differently?Without more information about what you're attempting to do, I can't really say that what you're doing is wrong, but I also can't say that there are any expectations from VDO yet that aren't being met. More context would certainly help get to the bottom of this question.> > Thanks > > David > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >
On Mon, May 4, 2020 at 10:02 AM Stefan S <stefan at kalaam.org> wrote:> > Hi David, > > in my opinion, VDO isn't worth the effort. I tried VDO for the same use case: backups. My dataset is 2-3TB and I backup daily. Even with a smaller dataset, VDO couldn't stand up to it's promises. It used tons of CPU and memory and with a lot of tuning I could get it to kind of work, but it became corrupted at the slightest problem (even a shutdown could do this, and shutdowns could also take hours).I'm sorry to hear you feel that way. I would be interested to understand the situations that you experienced this problem so that it can be addressed better in the future. Did you reach out for any guidance when it was happening?> > I have tried a number of things and I use a combination of two things now: > 1. a btrfs volume with force-compress enabled to store the intermediate data - it compresses my data to about 60% and that's enough for me > 2. use of bup (https://bup.github.io/) to store long-term backups. > > bup is incredibly efficient for my use case (full VM backups). Over the course of a whole month, the dataset only increases by about 30% from the initial size (I create a new full backup each month) - and this is with FULL backups of all VMs every day. bup backupsets can also be mounted via FUSE, giving you access to all stored versions in a filesystem-like manner. > > If you can backup at will you can probably forego the btrfs volume for intermediate storage - that is just a band-aid to work around a specific issue here. > > > Stefan > > > -- > > ________________________________ > From: CentOS <centos-bounces at centos.org> on behalf of david <david at daku.org> > Sent: Sunday, May 3, 2020 2:50 AM > To: centos at centos.org <centos at centos.org> > Subject: [CentOS] Understanding VDO vs ZFS > > Folks > > I'm looking for a solution for backups because ZFS has failed on me > too many times. In my environment, I have a large amount of data > (around 2tb) that I periodically back up. I keep the last 5 > "snapshots". I use rsync so that when I overwrite the oldest backup, > most of the data is already there and the backup completes quickly, > because only a small number of files have actually changed. > > Because of this low change rate, I have used ZFS with its > deduplication feature to store the data. I started using a Centos-6 > installation, and upgraded years ago to Centos7. Centos 8 is on my > agenda. However, I've had several data-loss events with ZFS where > because of a combination of errors and/or mistakes, the entire store > was lost. I've also noticed that ZFS is maintained separately from > Centos. At this moment, the Centos 8 update causes ZFS to > fail. Looking for an alternate, I'm trying VDO. > > In the VDO installation, I created a logical volume containing two > hard-drives, and defined VDO on top of that logical volume. It > appears to be running, yet I find the deduplication numbers don't > pass the smell test. I would expect that if the logical volume > contains three copies of essentially identical data, I should see > deduplication numbers close to 3.00, but instead I'm seeing numbers > like 1.15. I compute the compression number as follows: > Use df and extract the value for "1k blocks used" from the third column > use vdostats --verbose and extract the number titled "1K-blocks used" > > Divide the first by the second. > > Can you provide any advice on my use of ZFS or VDO without telling me > that I should be doing backups differently? > > Thanks > > David > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >
On Sat, May 2, 2020 at 10:54 PM david <david at daku.org> wrote:> > I'm looking for a solution for backups because ZFS has failed on me > too many times. In my environment, I have a large amount of data > (around 2tb) that I periodically back up. I keep the last 5 > "snapshots". I use rsync so that when I overwrite the oldest backup, > most of the data is already there and the backup completes quickly, > because only a small number of files have actually changed.Duplicity works well on CentOS. I had to perform a restore of a website and wiki after I [accidentally] deleted both. Backups are to another machine over SSH scheduled through Systemd. A Duplicity-based backup may help protect your data until you get something in place you like better. Jeff
Rather than dedupe at the file system level, I found the application level dedupe in BackupPC works really well... I've run BackupPC on both a big ZFS volume, and on a giant XFS over LVM over MDRAID volume (24 x 3TB disks organized as 2 x 11 raid6 plus 2 hot spares). The backuppc server I built at my last $job had 30 days of daily incrementals and 12 months of monthlies of about 25 servers+VMs (including Linux, Solaris, AIX, and Windows). The dedupe is done globally on a file level, so no matter how many instances of a file in all those backups ((30+12) * 25), there's only one file in the 'hive'. Bonus, BackupPC has a nice web UI for retrieving backups, I could create accounts for my various developers, and they could retrieve stuff from any covered date on any of the servers they had access to without my intervention. about the only manual intervention I ever needed to do over the several years this was running involved the Windows rsync client needing a PID file deleted after an unexpected reboot. -- -john r pierce recycling used bits in santa cruz