thr3ads.net - Btrfs devel - Q: Why subvolumes? [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Jerome Haltom

2013-Jul-23 11:59 UTC

Q: Why subvolumes?

May I ask why the decision to implement snapshotting through
subvolumes? I''ve been very curious about why the design wasn''t
to
simply allow snapshotting of any directory or file.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Buschka

2013-Jul-23 14:52 UTC

head link

AW: Why subvolumes?

Hi Jerome,

essentially, a btrfs sub volume is the root of a btrfs (you can take it and
mount it as it is). This is critical for the snapshot functionality: If you have
a sub volume (consisting of a snapshot) for, say, "/", and your system
goes south (e.g. after updating the kernel or another crucial system package),
then all you have to do it tell the Linux kernel at the bootloader prompt (via
the rootflags=...  parameter)  not to mount the default btrfs, but the snapshot.
Then, you can boot the "last known good" state of the system normally
and recover from the comfort of a running system.

The main point here is that the default btrfs "sub volume" (which you
would normally mount as /) is technically not different from any other sub
volume at all.

Best regards,
Andreas

-----Ursprüngliche Nachricht-----
Von: linux-btrfs-owner@vger.kernel.org
[mailto:linux-btrfs-owner@vger.kernel.org] Im Auftrag von Jerome Haltom
Gesendet: Dienstag, 23. Juli 2013 14:00
An: Linux Btrfs
Betreff: Q: Why subvolumes?

May I ask why the decision to implement snapshotting through subvolumes?
I''ve been very curious about why the design wasn''t to simply
allow snapshotting of any directory or file.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in the body of a message to majordomo@vger.kernel.org More majordomo info at 
http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2013-Jul-23 15:06 UTC

head link

Re: Q: Why subvolumes?

On Tue, Jul 23, 2013 at 06:59:35AM -0500, Jerome Haltom
wrote:> May I ask why the decision to implement snapshotting through
> subvolumes? I''ve been very curious about why the design
wasn''t to
> simply allow snapshotting of any directory or file.
   tl;dr: It just doesn''t work that way, and it''s hard to do
so within
the bounds of snapshots being atomic.

   It''s down to the way that snapshots are implemented (btrfs being a
copy-on-write filesystem). A snapshot is an (atomic) copy of the FS
tree for a subvolume, where the FS tree is the metadata tree which
holds the inode information, filenames, directory structure,
permissions and so forth. Being a CoW FS, we can do this easily and
trivially by copying only the root block of the tree -- a matter of a
few KiB. Running ls -R on a snapshot and its original will read
exactly the same blocks on the disk, except for the single top-level
block in each case. As the snapshot is modified, the metadata changes,
and parts of the FS tree for the snapshot are CoWed, leaving the
original blocks in place. There is a reference-counting mechanism here
as well, to ensure that we don''t leave unused blocks lying around the
place.

   Now... since the snapshot''s FS tree is a direct duplicate of the
original FS tree (actually, it''s the same tree, but they look like
different things to the outside world), they share everything --
including things like inode numbers. This is OK within a subvolume,
because we have the semantics that subvolumes have their own distinct
inode-number spaces. If we could snapshot arbitrary subsections of the
FS, we''d end up having to fix up inode numbers to ensure that they
were unique -- which can''t really be an atomic operation (unless you
want to have the FS locked while the kernel updates the inodes of the
billion files you just snapshotted).

   The other thing to talk about here is that while the FS tree is a
tree structure, it''s not a direct one-to-one map to the directory tree
structure. In fact, it looks more like a list of inodes, in inode
order, with some extra info for easily tracking through the list. The
B-tree structure of the FS tree is just a fast indexing method. So
snapshotting a directory entry within the FS tree would require
(somehow) making an atomic copy, or CoW copy, of only the parts of the
FS tree that fall under the directory in question -- so you''d end up
trying to take a sequence of records in the FS tree, of arbitrary size
(proportional roughly to the number of entries in the directory) and
copying them to somewhere else in the same tree in such a way that you
can automatically dereference the copies when you modify them. So,
ultimately, it boils down to being able to do CoW operations at the
byte level, which is going to introduce huge quantities of extra
metadata, and it all starts looking really awkward to implement (plus
having to deal with the long time taken to copy the directory entries
for the thing you''re snapshotting).

   I doubt it would be possible to retrofit btrfs to do it without
more or less a ground-up rewrite, if even then. I would further doubt
that you''d end up with something that would run with any kind of
acceptable performance, or with sane bounds on the amount of metadata
used.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- I am but mad north-north-west:  when the wind is southerly, I ---  
                       know a hawk from a handsaw.

Gabriel de Perthuis

2013-Jul-23 17:47 UTC

head link

Re: Q: Why subvolumes?

>    Now... since the snapshot''s FS tree is a direct duplicate of
the
> original FS tree (actually, it''s the same tree, but they look like
> different things to the outside world), they share everything --
> including things like inode numbers. This is OK within a subvolume,
> because we have the semantics that subvolumes have their own distinct
> inode-number spaces. If we could snapshot arbitrary subsections of the
> FS, we''d end up having to fix up inode numbers to ensure that they
> were unique -- which can''t really be an atomic operation (unless
you
> want to have the FS locked while the kernel updates the inodes of the
> billion files you just snapshotted).
I don''t think so; I just checked some snapshots and the inos are the
same.
Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).
>    The other thing to talk about here is that while the FS tree is a
> tree structure, it''s not a direct one-to-one map to the directory
tree
> structure. In fact, it looks more like a list of inodes, in inode
> order, with some extra info for easily tracking through the list. The
> B-tree structure of the FS tree is just a fast indexing method. So
> snapshotting a directory entry within the FS tree would require
> (somehow) making an atomic copy, or CoW copy, of only the parts of the
> FS tree that fall under the directory in question -- so you''d end
up
> trying to take a sequence of records in the FS tree, of arbitrary size
> (proportional roughly to the number of entries in the directory) and
> copying them to somewhere else in the same tree in such a way that you
> can automatically dereference the copies when you modify them. So,
> ultimately, it boils down to being able to do CoW operations at the
> byte level, which is going to introduce huge quantities of extra
> metadata, and it all starts looking really awkward to implement (plus
> having to deal with the long time taken to copy the directory entries
> for the thing you''re snapshotting).
Btrfs already does CoW of arbitrarily-large files (extent lists);
doing the same for directories doesn''t seem impossible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2013-Jul-23 19:30 UTC

head link

Re: Q: Why subvolumes?

On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis
wrote:> >    Now... since the snapshot''s FS tree is a direct duplicate
of the
> > original FS tree (actually, it''s the same tree, but they look
like
> > different things to the outside world), they share everything --
> > including things like inode numbers. This is OK within a subvolume,
> > because we have the semantics that subvolumes have their own distinct
> > inode-number spaces. If we could snapshot arbitrary subsections of the
> > FS, we''d end up having to fix up inode numbers to ensure that
they
> > were unique -- which can''t really be an atomic operation
(unless you
> > want to have the FS locked while the kernel updates the inodes of the
> > billion files you just snapshotted).
> 
> I don''t think so; I just checked some snapshots and the inos are
the same.
> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).
   That''s what I said. Our current implementation allows different
subvolumes to have the same inode numbers, which is what makes it
work. If you threw out the concept of subvolumes, or allowed snapshots
within subvolumes, then you''d be duplicating inodes within a
subvolume, which is one reason it doesn''t work.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- Unix: For controlling fungal diseases in crops. ---

Gabriel de Perthuis

2013-Jul-23 19:41 UTC

head link

Re: Q: Why subvolumes?

Le mar. 23 juil. 2013 21:30:13 CEST, Hugo Mills a écrit
:> On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:
>>>    Now... since the snapshot''s FS tree is a direct
duplicate of the
>>> original FS tree (actually, it''s the same tree, but they
look like
>>> different things to the outside world), they share everything --
>>> including things like inode numbers. This is OK within a subvolume,
>>> because we have the semantics that subvolumes have their own
distinct
>>> inode-number spaces. If we could snapshot arbitrary subsections of
the
>>> FS, we''d end up having to fix up inode numbers to ensure
that they
>>> were unique -- which can''t really be an atomic operation
(unless you
>>> want to have the FS locked while the kernel updates the inodes of
the
>>> billion files you just snapshotted).
>>
>> I don''t think so; I just checked some snapshots and the inos
are the same.
>> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows
this).
>
>    That''s what I said. Our current implementation allows different
> subvolumes to have the same inode numbers, which is what makes it
> work. If you threw out the concept of subvolumes, or allowed snapshots
> within subvolumes, then you''d be duplicating inodes within a
> subvolume, which is one reason it doesn''t work.
Sorry for misreading you.
Directory snapshots can work by giving a new device number to the snapshot.
There is no need to update inode numbers in that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jerome Haltom

2013-Jul-23 19:43 UTC

head link

Re: Q: Why subvolumes?

Why not just create the new dev_id on the destination snapshot of any
directory? That way the snapshot can share inodes with is source.

On Tue, Jul 23, 2013 at 2:30 PM, Hugo Mills <hugo@carfax.org.uk>
wrote:> On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:
>> >    Now... since the snapshot''s FS tree is a direct
duplicate of the
>> > original FS tree (actually, it''s the same tree, but they
look like
>> > different things to the outside world), they share everything --
>> > including things like inode numbers. This is OK within a
subvolume,
>> > because we have the semantics that subvolumes have their own
distinct
>> > inode-number spaces. If we could snapshot arbitrary subsections of
the
>> > FS, we''d end up having to fix up inode numbers to ensure
that they
>> > were unique -- which can''t really be an atomic operation
(unless you
>> > want to have the FS locked while the kernel updates the inodes of
the
>> > billion files you just snapshotted).
>>
>> I don''t think so; I just checked some snapshots and the inos
are the same.
>> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows
this).
>
>    That''s what I said. Our current implementation allows different
> subvolumes to have the same inode numbers, which is what makes it
> work. If you threw out the concept of subvolumes, or allowed snapshots
> within subvolumes, then you''d be duplicating inodes within a
> subvolume, which is one reason it doesn''t work.
>
>    Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk
==>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>          --- Unix: For controlling fungal diseases in crops. -----
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Jul-23 21:52 UTC

head link

Re: Q: Why subvolumes?

On Jul 23, 2013, at 1:43 PM, Jerome Haltom <wasabi@cogito.cx> wrote:
> Why not just create the new dev_id on the destination snapshot of any
> directory?
Right now, snapshots of subvolumes do not contain the contents of contained
subvolumes. Hmmm, that sounds horrid.

Subvolume A
	File 1
	File 2
	Subvolume B
		File 3
		File 4

If I snapshot subvolume A, the resulting snapshot does not contain File 3 and
File 4. Subvolume B is a regular folder in the snapshot of Subvolume A.

So if every directory were a subvolume by default, this limitation would need to
be resolved or snapshotting would become useless. I''m sure
there''s a more coherent explanation why this isn''t desired.
> That way the snapshot can share inodes with is source.

Snapshots already share inode numbers.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jerome Haltom

2013-Jul-23 23:39 UTC

head link

Re: Q: Why subvolumes?

Yeah. I was merely curious about the architecture limits that drove
the design this way, to begin with. Mostly because it seems "odd". It
seems like the most obvious and most natural thing from the user''s
perspective to do would just be able to reflink directories. Like
every decent source control system that exists, for instance. So, I
figured there must be some very good reason it wasn''t done like that.

I''m still not completely sure what that very good reason is. Obviously
whatever structure that currently exists for subvolumes would need to
continue existing, to begin a unique inode scope. But, since
apparently the VFS can be instructed to plop a new dev_id anywhere in
the hierarchy, I I still don''t see why explicit subvolumes are
required. Seems more natural to be able to put a quota on a directory.
To be able to set raid policy on a directory. Compression on a
directory. COW semantics on a directory. Etc.

Ahh well, some of you gave really nice detailed answers, and I
appreciate that. Thanks.

On Tue, Jul 23, 2013 at 4:52 PM, Chris Murphy <lists@colorremedies.com>
wrote:>
> On Jul 23, 2013, at 1:43 PM, Jerome Haltom <wasabi@cogito.cx> wrote:
>
>> Why not just create the new dev_id on the destination snapshot of any
>> directory?
>
> Right now, snapshots of subvolumes do not contain the contents of contained
subvolumes. Hmmm, that sounds horrid.
>
> Subvolume A
>         File 1
>         File 2
>         Subvolume B
>                 File 3
>                 File 4
>
> If I snapshot subvolume A, the resulting snapshot does not contain File 3
and File 4. Subvolume B is a regular folder in the snapshot of Subvolume A.
>
> So if every directory were a subvolume by default, this limitation would
need to be resolved or snapshotting would become useless. I''m sure
there''s a more coherent explanation why this isn''t desired.
>
>> That way the snapshot can share inodes with is source.
>
>
> Snapshots already share inode numbers.
>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jul-24 01:27 UTC

head link

Re: Q: Why subvolumes?

On Tue, Jul 23, 2013 at 06:39:57PM -0500, Jerome Haltom
wrote:> Yeah. I was merely curious about the architecture limits that drove
> the design this way, to begin with. Mostly because it seems
"odd". It
> seems like the most obvious and most natural thing from the user''s
> perspective to do would just be able to reflink directories. Like
> every decent source control system that exists, for instance. So, I
> figured there must be some very good reason it wasn''t done like
that.
> 
> I''m still not completely sure what that very good reason is.
Obviously
> whatever structure that currently exists for subvolumes would need to
> continue existing, to begin a unique inode scope. But, since
> apparently the VFS can be instructed to plop a new dev_id anywhere in
> the hierarchy, I I still don''t see why explicit subvolumes are
> required. Seems more natural to be able to put a quota on a directory.
> To be able to set raid policy on a directory. Compression on a
> directory. COW semantics on a directory. Etc.
> 
> Ahh well, some of you gave really nice detailed answers, and I
> appreciate that. Thanks.
>
Subvolumes are described as directories simply to make it easier to understand.
Directories do not change the heirarchy within the file system itself, they are
simply items in the btree like anything else, they are not special at all.
Subvolumes are _represented_ as directories, but really the directories are just
links to subvolumes.  Subvolumes are a completely separate b-tree, it has
it''s
own locking, it''s own inode numbering and everything.  And this
isn''t inode
numbering for the sake of inode numbering, our inode numbers are picked by
simply being the next largest objectid we can add to our tree.  Since a
subvolume is it''s own tree it''s inode numbers start over at
the begining.

So it''s not that we can just fork off a directory and snapshot there,
because
it''s not a tree, it''s just an item.  A subvolume is its own
tree, which can be
snapshotted and locked independantly from the other subvolumes.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Jul-24 02:02 UTC

head link

Re: Q: Why subvolumes?

On Jul 23, 2013, at 7:27 PM, Josef Bacik <jbacik@fusionio.com>
wrote:> 
> Subvolumes are described as directories simply to make it easier to
understand.
> Directories do not change the heirarchy within the file system itself, they
are
> simply items in the btree like anything else, they are not special at all.
> Subvolumes are _represented_ as directories, but really the directories are
just
> links to subvolumes.  Subvolumes are a completely separate b-tree, it has
it''s
> own locking, it''s own inode numbering and everything.  And this
isn''t inode
> numbering for the sake of inode numbering, our inode numbers are picked by
> simply being the next largest objectid we can add to our tree.  Since a
> subvolume is it''s own tree it''s inode numbers start over
at the begining.
> 
> So it''s not that we can just fork off a directory and snapshot
there, because
> it''s not a tree, it''s just an item.  A subvolume is its
own tree, which can be
> snapshotted and locked independantly from the other subvolumes.  Thanks,

I like this, it''s useful. Could it be integrated into the Wiki?


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexandre Oliva

2013-Aug-04 14:56 UTC

head link

Re: Q: Why subvolumes?

On Jul 23, 2013, Jerome Haltom <wasabi@cogito.cx> wrote:
> Why not just create the new dev_id on the destination snapshot of any
> directory? That way the snapshot can share inodes with is source.
Agreed.  Nothing stops us from implementing snapshotting of any
directory whatsoever: all it takes is to take a snapshot of the
subvolume enclosing the directory we want to snapshot, removing
everything that''s not in the requested directory from the snapshot, and
making that directory the root of the snapshot.  The only tricky bit
here AFAICT is to arrange for the non-snapshotted subtree components to
be cleaned up in background.  If we had some primitive to unlink an
entire subtree and clean it up in background we could use that.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jul 2013 - Q: Why subvolumes?

Q: Why subvolumes?

AW: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?

Re: Q: Why subvolumes?