May I ask why the decision to implement snapshotting through subvolumes? I''ve been very curious about why the design wasn''t to simply allow snapshotting of any directory or file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jerome, essentially, a btrfs sub volume is the root of a btrfs (you can take it and mount it as it is). This is critical for the snapshot functionality: If you have a sub volume (consisting of a snapshot) for, say, "/", and your system goes south (e.g. after updating the kernel or another crucial system package), then all you have to do it tell the Linux kernel at the bootloader prompt (via the rootflags=... parameter) not to mount the default btrfs, but the snapshot. Then, you can boot the "last known good" state of the system normally and recover from the comfort of a running system. The main point here is that the default btrfs "sub volume" (which you would normally mount as /) is technically not different from any other sub volume at all. Best regards, Andreas -----Ursprüngliche Nachricht----- Von: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] Im Auftrag von Jerome Haltom Gesendet: Dienstag, 23. Juli 2013 14:00 An: Linux Btrfs Betreff: Q: Why subvolumes? May I ask why the decision to implement snapshotting through subvolumes? I''ve been very curious about why the design wasn''t to simply allow snapshotting of any directory or file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 23, 2013 at 06:59:35AM -0500, Jerome Haltom wrote:> May I ask why the decision to implement snapshotting through > subvolumes? I''ve been very curious about why the design wasn''t to > simply allow snapshotting of any directory or file.tl;dr: It just doesn''t work that way, and it''s hard to do so within the bounds of snapshots being atomic. It''s down to the way that snapshots are implemented (btrfs being a copy-on-write filesystem). A snapshot is an (atomic) copy of the FS tree for a subvolume, where the FS tree is the metadata tree which holds the inode information, filenames, directory structure, permissions and so forth. Being a CoW FS, we can do this easily and trivially by copying only the root block of the tree -- a matter of a few KiB. Running ls -R on a snapshot and its original will read exactly the same blocks on the disk, except for the single top-level block in each case. As the snapshot is modified, the metadata changes, and parts of the FS tree for the snapshot are CoWed, leaving the original blocks in place. There is a reference-counting mechanism here as well, to ensure that we don''t leave unused blocks lying around the place. Now... since the snapshot''s FS tree is a direct duplicate of the original FS tree (actually, it''s the same tree, but they look like different things to the outside world), they share everything -- including things like inode numbers. This is OK within a subvolume, because we have the semantics that subvolumes have their own distinct inode-number spaces. If we could snapshot arbitrary subsections of the FS, we''d end up having to fix up inode numbers to ensure that they were unique -- which can''t really be an atomic operation (unless you want to have the FS locked while the kernel updates the inodes of the billion files you just snapshotted). The other thing to talk about here is that while the FS tree is a tree structure, it''s not a direct one-to-one map to the directory tree structure. In fact, it looks more like a list of inodes, in inode order, with some extra info for easily tracking through the list. The B-tree structure of the FS tree is just a fast indexing method. So snapshotting a directory entry within the FS tree would require (somehow) making an atomic copy, or CoW copy, of only the parts of the FS tree that fall under the directory in question -- so you''d end up trying to take a sequence of records in the FS tree, of arbitrary size (proportional roughly to the number of entries in the directory) and copying them to somewhere else in the same tree in such a way that you can automatically dereference the copies when you modify them. So, ultimately, it boils down to being able to do CoW operations at the byte level, which is going to introduce huge quantities of extra metadata, and it all starts looking really awkward to implement (plus having to deal with the long time taken to copy the directory entries for the thing you''re snapshotting). I doubt it would be possible to retrofit btrfs to do it without more or less a ground-up rewrite, if even then. I would further doubt that you''d end up with something that would run with any kind of acceptable performance, or with sane bounds on the amount of metadata used. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I am but mad north-north-west: when the wind is southerly, I --- know a hawk from a handsaw.
> Now... since the snapshot''s FS tree is a direct duplicate of the > original FS tree (actually, it''s the same tree, but they look like > different things to the outside world), they share everything -- > including things like inode numbers. This is OK within a subvolume, > because we have the semantics that subvolumes have their own distinct > inode-number spaces. If we could snapshot arbitrary subsections of the > FS, we''d end up having to fix up inode numbers to ensure that they > were unique -- which can''t really be an atomic operation (unless you > want to have the FS locked while the kernel updates the inodes of the > billion files you just snapshotted).I don''t think so; I just checked some snapshots and the inos are the same. Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).> The other thing to talk about here is that while the FS tree is a > tree structure, it''s not a direct one-to-one map to the directory tree > structure. In fact, it looks more like a list of inodes, in inode > order, with some extra info for easily tracking through the list. The > B-tree structure of the FS tree is just a fast indexing method. So > snapshotting a directory entry within the FS tree would require > (somehow) making an atomic copy, or CoW copy, of only the parts of the > FS tree that fall under the directory in question -- so you''d end up > trying to take a sequence of records in the FS tree, of arbitrary size > (proportional roughly to the number of entries in the directory) and > copying them to somewhere else in the same tree in such a way that you > can automatically dereference the copies when you modify them. So, > ultimately, it boils down to being able to do CoW operations at the > byte level, which is going to introduce huge quantities of extra > metadata, and it all starts looking really awkward to implement (plus > having to deal with the long time taken to copy the directory entries > for the thing you''re snapshotting).Btrfs already does CoW of arbitrarily-large files (extent lists); doing the same for directories doesn''t seem impossible. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:> > Now... since the snapshot''s FS tree is a direct duplicate of the > > original FS tree (actually, it''s the same tree, but they look like > > different things to the outside world), they share everything -- > > including things like inode numbers. This is OK within a subvolume, > > because we have the semantics that subvolumes have their own distinct > > inode-number spaces. If we could snapshot arbitrary subsections of the > > FS, we''d end up having to fix up inode numbers to ensure that they > > were unique -- which can''t really be an atomic operation (unless you > > want to have the FS locked while the kernel updates the inodes of the > > billion files you just snapshotted). > > I don''t think so; I just checked some snapshots and the inos are the same. > Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).That''s what I said. Our current implementation allows different subvolumes to have the same inode numbers, which is what makes it work. If you threw out the concept of subvolumes, or allowed snapshots within subvolumes, then you''d be duplicating inodes within a subvolume, which is one reason it doesn''t work. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Unix: For controlling fungal diseases in crops. ---
Le mar. 23 juil. 2013 21:30:13 CEST, Hugo Mills a écrit :> On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote: >>> Now... since the snapshot''s FS tree is a direct duplicate of the >>> original FS tree (actually, it''s the same tree, but they look like >>> different things to the outside world), they share everything -- >>> including things like inode numbers. This is OK within a subvolume, >>> because we have the semantics that subvolumes have their own distinct >>> inode-number spaces. If we could snapshot arbitrary subsections of the >>> FS, we''d end up having to fix up inode numbers to ensure that they >>> were unique -- which can''t really be an atomic operation (unless you >>> want to have the FS locked while the kernel updates the inodes of the >>> billion files you just snapshotted). >> >> I don''t think so; I just checked some snapshots and the inos are the same. >> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this). > > That''s what I said. Our current implementation allows different > subvolumes to have the same inode numbers, which is what makes it > work. If you threw out the concept of subvolumes, or allowed snapshots > within subvolumes, then you''d be duplicating inodes within a > subvolume, which is one reason it doesn''t work.Sorry for misreading you. Directory snapshots can work by giving a new device number to the snapshot. There is no need to update inode numbers in that case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why not just create the new dev_id on the destination snapshot of any directory? That way the snapshot can share inodes with is source. On Tue, Jul 23, 2013 at 2:30 PM, Hugo Mills <hugo@carfax.org.uk> wrote:> On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote: >> > Now... since the snapshot''s FS tree is a direct duplicate of the >> > original FS tree (actually, it''s the same tree, but they look like >> > different things to the outside world), they share everything -- >> > including things like inode numbers. This is OK within a subvolume, >> > because we have the semantics that subvolumes have their own distinct >> > inode-number spaces. If we could snapshot arbitrary subsections of the >> > FS, we''d end up having to fix up inode numbers to ensure that they >> > were unique -- which can''t really be an atomic operation (unless you >> > want to have the FS locked while the kernel updates the inodes of the >> > billion files you just snapshotted). >> >> I don''t think so; I just checked some snapshots and the inos are the same. >> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this). > > That''s what I said. Our current implementation allows different > subvolumes to have the same inode numbers, which is what makes it > work. If you threw out the concept of subvolumes, or allowed snapshots > within subvolumes, then you''d be duplicating inodes within a > subvolume, which is one reason it doesn''t work. > > Hugo. > > -- > === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==> PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk > --- Unix: For controlling fungal diseases in crops. ----- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 23, 2013, at 1:43 PM, Jerome Haltom <wasabi@cogito.cx> wrote:> Why not just create the new dev_id on the destination snapshot of any > directory?Right now, snapshots of subvolumes do not contain the contents of contained subvolumes. Hmmm, that sounds horrid. Subvolume A File 1 File 2 Subvolume B File 3 File 4 If I snapshot subvolume A, the resulting snapshot does not contain File 3 and File 4. Subvolume B is a regular folder in the snapshot of Subvolume A. So if every directory were a subvolume by default, this limitation would need to be resolved or snapshotting would become useless. I''m sure there''s a more coherent explanation why this isn''t desired.> That way the snapshot can share inodes with is source.Snapshots already share inode numbers. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Yeah. I was merely curious about the architecture limits that drove the design this way, to begin with. Mostly because it seems "odd". It seems like the most obvious and most natural thing from the user''s perspective to do would just be able to reflink directories. Like every decent source control system that exists, for instance. So, I figured there must be some very good reason it wasn''t done like that. I''m still not completely sure what that very good reason is. Obviously whatever structure that currently exists for subvolumes would need to continue existing, to begin a unique inode scope. But, since apparently the VFS can be instructed to plop a new dev_id anywhere in the hierarchy, I I still don''t see why explicit subvolumes are required. Seems more natural to be able to put a quota on a directory. To be able to set raid policy on a directory. Compression on a directory. COW semantics on a directory. Etc. Ahh well, some of you gave really nice detailed answers, and I appreciate that. Thanks. On Tue, Jul 23, 2013 at 4:52 PM, Chris Murphy <lists@colorremedies.com> wrote:> > On Jul 23, 2013, at 1:43 PM, Jerome Haltom <wasabi@cogito.cx> wrote: > >> Why not just create the new dev_id on the destination snapshot of any >> directory? > > Right now, snapshots of subvolumes do not contain the contents of contained subvolumes. Hmmm, that sounds horrid. > > Subvolume A > File 1 > File 2 > Subvolume B > File 3 > File 4 > > If I snapshot subvolume A, the resulting snapshot does not contain File 3 and File 4. Subvolume B is a regular folder in the snapshot of Subvolume A. > > So if every directory were a subvolume by default, this limitation would need to be resolved or snapshotting would become useless. I''m sure there''s a more coherent explanation why this isn''t desired. > >> That way the snapshot can share inodes with is source. > > > Snapshots already share inode numbers. > > > Chris Murphy-- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 23, 2013 at 06:39:57PM -0500, Jerome Haltom wrote:> Yeah. I was merely curious about the architecture limits that drove > the design this way, to begin with. Mostly because it seems "odd". It > seems like the most obvious and most natural thing from the user''s > perspective to do would just be able to reflink directories. Like > every decent source control system that exists, for instance. So, I > figured there must be some very good reason it wasn''t done like that. > > I''m still not completely sure what that very good reason is. Obviously > whatever structure that currently exists for subvolumes would need to > continue existing, to begin a unique inode scope. But, since > apparently the VFS can be instructed to plop a new dev_id anywhere in > the hierarchy, I I still don''t see why explicit subvolumes are > required. Seems more natural to be able to put a quota on a directory. > To be able to set raid policy on a directory. Compression on a > directory. COW semantics on a directory. Etc. > > Ahh well, some of you gave really nice detailed answers, and I > appreciate that. Thanks. >Subvolumes are described as directories simply to make it easier to understand. Directories do not change the heirarchy within the file system itself, they are simply items in the btree like anything else, they are not special at all. Subvolumes are _represented_ as directories, but really the directories are just links to subvolumes. Subvolumes are a completely separate b-tree, it has it''s own locking, it''s own inode numbering and everything. And this isn''t inode numbering for the sake of inode numbering, our inode numbers are picked by simply being the next largest objectid we can add to our tree. Since a subvolume is it''s own tree it''s inode numbers start over at the begining. So it''s not that we can just fork off a directory and snapshot there, because it''s not a tree, it''s just an item. A subvolume is its own tree, which can be snapshotted and locked independantly from the other subvolumes. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 23, 2013, at 7:27 PM, Josef Bacik <jbacik@fusionio.com> wrote:> > Subvolumes are described as directories simply to make it easier to understand. > Directories do not change the heirarchy within the file system itself, they are > simply items in the btree like anything else, they are not special at all. > Subvolumes are _represented_ as directories, but really the directories are just > links to subvolumes. Subvolumes are a completely separate b-tree, it has it''s > own locking, it''s own inode numbering and everything. And this isn''t inode > numbering for the sake of inode numbering, our inode numbers are picked by > simply being the next largest objectid we can add to our tree. Since a > subvolume is it''s own tree it''s inode numbers start over at the begining. > > So it''s not that we can just fork off a directory and snapshot there, because > it''s not a tree, it''s just an item. A subvolume is its own tree, which can be > snapshotted and locked independantly from the other subvolumes. Thanks,I like this, it''s useful. Could it be integrated into the Wiki? Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 23, 2013, Jerome Haltom <wasabi@cogito.cx> wrote:> Why not just create the new dev_id on the destination snapshot of any > directory? That way the snapshot can share inodes with is source.Agreed. Nothing stops us from implementing snapshotting of any directory whatsoever: all it takes is to take a snapshot of the subvolume enclosing the directory we want to snapshot, removing everything that''s not in the requested directory from the snapshot, and making that directory the root of the snapshot. The only tricky bit here AFAICT is to arrange for the non-snapshotted subtree components to be cleaned up in background. If we had some primitive to unlink an entire subtree and clean it up in background we could use that. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html