On Sun, Oct 13, 2013 at 11:54:42PM -0300, Rogério Brito
wrote:> Hi.
>
> I am seriously considering employing btrfs on my systems, particularly due
> to some space-saving features that it has (namely, deduplication and
> compression).
>
> In fact, I was (a few moments ago) trying to back up some of my systems to
a
> 2TB HD that has an ext4 filesystem and, in the middle of the last one, I
got
> the error message that the backup HD was full.
>
> Given that what I backup there are systems where I have some of the data
> present multiple times (e.g., my mailbox that is sync''ed via
offlineimap, or
> videos that I download from online learning sites) and that such data
> consists of many small files that are highly compressible (the e-mails) or
> large files (the videos), I would like to employ btrfs.
>
> So, after reading the documentation on https://btrfs.wiki.kernel.org/, I am
> still unsure of some points and I would like to have some clarifications
> and/or expectations set straight.
>
>
> * I understand that I can convert an ext4 filesystem to btrfs. Will such
> conversion work with an almost full ext4 filesystem? How much overhead
> will be needed to perform the conversion? I can (temporarily) remove some
> files that already are on this backup.
I don''t think we''ve ever explored the bounds of exactly how
much
space you need for conversion. It''ll be an absolute minimum of 0.1% of
the data used, probably quite a bit more, for the metadata.
> * Is it possible to deduplicate the files that are already in it? As
> mentioned before, there are likely to be many, and some of them are on
the
> order of 1 to 2GBs.
Yes, there''s an out-of-band deduplicator. I''ll have to go
and look
it up to work out exactly what tools you need to make it work. :)
> * Doing a defragmentation with the filesystem mounted with compression will
> recompress the files (if they are deemed compressible by the
> filesystem). Is that understanding correct? Will compressed blocks among
> many files also be deduplicated?
You''ll probably need to add -c to the defrag command, but yes, you
can persuade the FS to recompress files. I''m not sure how this affects
deduplication.
> * How exactly do the recently merged offline deduplication features in the
> kernel interfere with what was (in my limited understanding) already
> possible with userspace tools like <https://github.com/g2p/bedup>?
Are
> such third-party tools likely to be integrated into btrfs-progs? Are they
> supposed to be kept separate?
The out-of-band (rather than offline) dedup kernel features simply
give a more reliable API call for merging identical extents, as it
allows them to be locked during the process -- without that API call,
there''s a race condition that could potentially lead to data loss.
> * Does this change the on-disk format? Putting it another way, will it be
> safe to possibly go back to a previous kernel, if there is some problem
> with the current kernels? (Not that I necessarily want to go back to a
> previous kernel, but, sometimes, one would need to, say, git bisect the
> kernel).
No, that feature doesn''t change the on-disk format.
> * I most likely *don''t* want to use online deduplication (given my
bad
> experiences with ZFS). With that in mind, is the current userspace
> deduplicaton intended to be run as a cron job? Is the offline
> deduplication too memory intensive? How much RAM would it be needed for
a
> 2TB filesystem? Are 2GB enough? How about 4GB?
Out-of-band dedup is indeed the kind of thing you''d run as a cron
job. However, there''s a couple of better approaches you can use. I
don''t know about RAM usage, I''m afraid.
If you use rsync for backups, then you can keep one subvolume as
the "current" version of the backups, and use the --in-place option of
rsync. Then, immediately after finishing a backup run, you can
snapshot that subvolume to give yourself a read-only historical
record. This will ensure that the maximum quantity of data is shared
between the individual backups without having to use OOB dedup.
If your source FS is btrfs as well, you can do pretty much the same
thing (it''s a little more complicated to set up) with btrfs
send/receive, which uses the inherent knowledge of the FS to work out
the differences more efficiently than rsync.
> * Will further runs of the offline deduplication be "incremental"
in some
> imprecise sense of the word? That is, if I run the deduplication once and
> immediately run it again (supposing nothing changes), will the 2nd time
be
> faster than the first? (If the disk caches are dropped?)
I don''t know, but probably (since it should be able to tell that
the extents are already CoW copies).
> * Will I be able to add further HDs to my btrfs filesystem, once I get some
> more money to run something like a RAID0 configuration? If I get more HDs
> later, will I be able to change the configuration to, say, RAID5 or
RAID6?
> I don''t intend to use lvm, unless I have to.
Yes, you can change RAID levels on the fly, while the FS is mounted.
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I can resist everything except temptation ---