Hello, Various people have complained about how BTRFS deals with subvolumes recently, specifically the fact that they all have the same inode number, and there''s no discrete seperation from one subvolume to another. Christoph asked that I lay out a basic design document of how we want subvolumes to work so we can hash everything out now, fix what is broken, and then move forward with a design that everybody is more or less happy with. I apologize in advance for how freaking long this email is going to be. I assume that most people are generally familiar with how BTRFS works, so I''m not going to bother explaining in great detail some stuff. === What are subvolumes? == They are just another tree. In BTRFS we have various b-trees to describe the filesystem. A few of them are filesystem wide, such as the extent tree, chunk tree, root tree etc. The tree''s that hold the actual filesystem data, that is inodes and such, are kept in their own b-tree. This is how subvolumes and snapshots appear on disk, they are simply new b-trees with all of the file data contained within them. === What do subvolumes look like? == All the user sees are directories. They act like any other directory acts, with a few exceptions 1) You cannot hardlink between subvolumes. This is because subvolumes have their own inode numbers and such, think of them as seperate mounts in this case, you cannot hardlink between two mounts because the link needs to point to the same on disk inode, which is impossible between two different filesystems. The same is true for subvolumes, they have their own trees with their own inodes and inode numbers, so it''s impossible to hardlink between them. 1a) In case it wasn''t clear from above, each subvolume has their own inode numbers, so you can have the same inode numbers used between two different subvolumes, since they are two different trees. 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s extra metadata to keep track of them, so you have to use one of our ioctls to delete subvolumes/snapshots. But permissions and everything else they are the same. There is one tricky thing. When you create a subvolume, the directory inode that is created in the parent subvolume has the inode number of 256. So if you have a bunch of subvolumes in the same parent subvolume, you are going to have a bunch of directories with the inode number of 256. This is so when users cd into a subvolume we can know its a subvolume and do all the normal voodoo to start looking in the subvolumes tree instead of the parent subvolumes tree. This is where things go a bit sideways. We had serious problems with NFS, but thankfully NFS gives us a bunch of hooks to get around these problems. CIFS/Samba do not, so we will have problems there, not to mention any other userspace application that looks at inode numbers. === How do we want subvolumes to work from a user perspective? == 1) Users need to be able to create their own subvolumes. The permission semantics will be absolutely the same as creating directories, so I don''t think this is too tricky. We want this because you can only take snapshots of subvolumes, and so it is important that users be able to create their own discrete snapshottable targets. 2) Users need to be able to snapshot their subvolumes. This is basically the same as #1, but it bears repeating. 3) Subvolumes shouldn''t need to be specifically mounted. This is also important, we don''t want users to have to go around mounting their subvolumes up manually one-by-one. Today users just cd into subvolumes and it works, just like cd''ing into a directory. === Quotas == This is a huge topic in and of itself, but Christoph mentioned wanting to have an idea of what we wanted to do with it, so I''m putting it here. There are really 2 things here 1) Limiting the size of subvolumes. This is really easy for us, just create a subvolume and at creation time set a maximum size it can grow to and not let it go farther than that. Nice, simple and straightforward. 2) Normal quotas, via the quota tools. This just comes down to how do we want to charge users, do we want to do it per subvolume, or per filesystem. My vote is per filesystem. Obviously this will make it tricky with snapshots, but I think if we''re just charging the diff''s between the original volume and the snapshot to the user then that will be the easiest for people to understand, rather than making a snapshot all of a sudden count the users currently used quota * 2. === What do we do? == This is where I expect to see the most discussion. Here is what I want to do 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic that way. This unfortunately will be an incompatible format change, but the sooner we get this adressed the easier it will be in the long run. Obviously when I say format change I mean via the incompat bits we have, so old fs''s won''t be broken and such. 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we just do dentry trickery, but that doesn''t make the boundary between subvolumes clear, so it will confuse people (and samba) when they walk into a subvolume and all of a sudden the inode numbers are the same as in the directory behind them. With doing the referral mount thing, each subvolume appears to be its own mount and that way things like NFS and samba will work properly. I feel like I''m forgetting something here, hopefully somebody will point it out. === Conclusion == There are definitely some wonky things with subvolumes, but I don''t think they are things that cannot be fixed now. Some of these changes will require incompat format changes, but it''s either we fix it now, or later on down the road when BTRFS starts getting used in production really find out how many things our current scheme breaks and then have to do the changes then. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:> 1) Users need to be able to create their own subvolumes. The permission > semantics will be absolutely the same as creating directories, so I don''t think > this is too tricky. We want this because you can only take snapshots of > subvolumes, and so it is important that users be able to create their own > discrete snapshottable targets. > > 2) Users need to be able to snapshot their subvolumes. This is basically the > same as #1, but it bears repeating. > > 3) Subvolumes shouldn''t need to be specifically mounted. This is also > important, we don''t want users to have to go around mounting their subvolumes up > manually one-by-one. Today users just cd into subvolumes and it works, just > like cd''ing into a directory.It would be helpful to be able to create subvolumes off existing directories, instead of creating a subvolume and having to copy all the data around. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote:> > === How do we want subvolumes to work from a user perspective? ==> > 1) Users need to be able to create their own subvolumes. The permission > semantics will be absolutely the same as creating directories, so I don''t think > this is too tricky. We want this because you can only take snapshots of > subvolumes, and so it is important that users be able to create their own > discrete snapshottable targets. > > 2) Users need to be able to snapshot their subvolumes. This is basically the > same as #1, but it bears repeating.could it be possible to convert a directory into a volume? or at least base a snapshot off it? C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Josef Bacik''s message of 2010-12-01 09:21:36 -0500:> Hello, > > Various people have complained about how BTRFS deals with subvolumes recently, > specifically the fact that they all have the same inode number, and there''s no > discrete seperation from one subvolume to another. Christoph asked that I lay > out a basic design document of how we want subvolumes to work so we can hash > everything out now, fix what is broken, and then move forward with a design that > everybody is more or less happy with. I apologize in advance for how freaking > long this email is going to be. I assume that most people are generally > familiar with how BTRFS works, so I''m not going to bother explaining in great > detail some stuff.Thanks for writing this up.> === What do we do? ==> > This is where I expect to see the most discussion. Here is what I want to do > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > that way. This unfortunately will be an incompatible format change, but the > sooner we get this adressed the easier it will be in the long run. Obviously > when I say format change I mean via the incompat bits we have, so old fs''s won''t > be broken and such.If they don''t have inode number 256, what inode number do they have? I''m assuming you mean the subvolume is given an inode number in the parent directory just like any other dir, but this doesn''t get rid of the duplicate inode problem. I think it ends up making it less clear, but I''m open to suggestions ;) We could give each subvol a different devt, which is something Christoph had asked about as well. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55 -0500:> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote: > > > > === How do we want subvolumes to work from a user perspective? ==> > > > 1) Users need to be able to create their own subvolumes. Â The permission > > semantics will be absolutely the same as creating directories, so I don''t think > > this is too tricky. Â We want this because you can only take snapshots of > > subvolumes, and so it is important that users be able to create their own > > discrete snapshottable targets. > > > > 2) Users need to be able to snapshot their subvolumes. Â This is basically the > > same as #1, but it bears repeating. > > could it be possible to convert a directory into a volume? or at > least base a snapshot off it?I''m afraid this turns into the same complexity as creating a new volume and copying all the files/dirs in by hand. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55 -0500: >> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote: >> > >> > === How do we want subvolumes to work from a user perspective? ==>> > >> > 1) Users need to be able to create their own subvolumes. The permission >> > semantics will be absolutely the same as creating directories, so I don''t think >> > this is too tricky. We want this because you can only take snapshots of >> > subvolumes, and so it is important that users be able to create their own >> > discrete snapshottable targets. >> > >> > 2) Users need to be able to snapshot their subvolumes. This is basically the >> > same as #1, but it bears repeating. >> >> could it be possible to convert a directory into a volume? or at >> least base a snapshot off it? > > I''m afraid this turns into the same complexity as creating a new volume > and copying all the files/dirs in by hand.ok; if i create an empty volume, and use cp --reflink, it would have the desired affect though, right? C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from C Anthony Risinger''s message of 2010-12-01 11:03:23 -0500:> On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <chris.mason@oracle.com> wrote: > > Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55 -0500: > >> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote: > >> > > >> > === How do we want subvolumes to work from a user perspective? ==> >> > > >> > 1) Users need to be able to create their own subvolumes. Â The permission > >> > semantics will be absolutely the same as creating directories, so I don''t think > >> > this is too tricky. Â We want this because you can only take snapshots of > >> > subvolumes, and so it is important that users be able to create their own > >> > discrete snapshottable targets. > >> > > >> > 2) Users need to be able to snapshot their subvolumes. Â This is basically the > >> > same as #1, but it bears repeating. > >> > >> could it be possible to convert a directory into a volume? Â or at > >> least base a snapshot off it? > > > > I''m afraid this turns into the same complexity as creating a new volume > > and copying all the files/dirs in by hand. > > ok; if i create an empty volume, and use cp --reflink, it would have > the desired affect though, right?Almost, for no good reason at all our cp --reflink doesn''t reflink across subvols. I''ll get that fixed up. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:> Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55 -0500: > > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote: > > > > > > === How do we want subvolumes to work from a user perspective? ==> > > > > > 1) Users need to be able to create their own subvolumes. Â The permission > > > semantics will be absolutely the same as creating directories, so I don''t think > > > this is too tricky. Â We want this because you can only take snapshots of > > > subvolumes, and so it is important that users be able to create their own > > > discrete snapshottable targets. > > > > > > 2) Users need to be able to snapshot their subvolumes. Â This is basically the > > > same as #1, but it bears repeating. > > > > could it be possible to convert a directory into a volume? or at > > least base a snapshot off it? > > I''m afraid this turns into the same complexity as creating a new volume > and copying all the files/dirs in by hand.Except you wouldn''t have to copy data, only metadata. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:> === Quotas ==> > This is a huge topic in and of itself, but Christoph mentioned wanting to have > an idea of what we wanted to do with it, so I''m putting it here. There are > really 2 things here > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > subvolume and at creation time set a maximum size it can grow to and not let it > go farther than that. Nice, simple and straightforward. > > 2) Normal quotas, via the quota tools. This just comes down to how do we want > to charge users, do we want to do it per subvolume, or per filesystem. My vote > is per filesystem. Obviously this will make it tricky with snapshots, but I > think if we''re just charging the diff''s between the original volume and the > snapshot to the user then that will be the easiest for people to understand, > rather than making a snapshot all of a sudden count the users currently used > quota * 2.This is going to be tricky to get the semantics right, I suspect. Say you''ve created a subvolume, A, containing 10G of Useful Stuff (say, a base image for VMs). This counts 10G against your quota. Now, I come along and snapshot that subvolume (as a writable subvolume) -- call it B. This is essentially free for me, because I''ve got a COW copy of your subvolume (and the original counts against your quota). If I now modify a file in subvolume B, the full modified section goes onto my quota. This is all well and good. But what happens if you delete your subvolume, A? Suddenly, I get lumbered with 10G of extra files. Worse, what happens if someone else had made a snapshot of A, too? Who gets the 10G added to their quota, me or them? What if I''d filled up my quota? Would that stop you from deleting your copy, because my copy can''t be charged against my quota? Would I just end up unexpectedly 10G over quota? This is a whole gigantic can of worms, as far as I can see, and I don''t think it''s going to be possible to implement quotas, even on a filesystem level, until there''s some good and functional model for dealing with all the implications of COW copies. :( Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I believe that it''s closely correlated with --- the aeroswine coefficient.
Hugo Mills wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: >> === Quotas ==>> >> This is a huge topic in and of itself, but Christoph mentioned wanting to have >> an idea of what we wanted to do with it, so I''m putting it here. There are >> really 2 things here >> >> 1) Limiting the size of subvolumes. This is really easy for us, just create a >> subvolume and at creation time set a maximum size it can grow to and not let it >> go farther than that. Nice, simple and straightforward. >> >> 2) Normal quotas, via the quota tools. This just comes down to how do we want >> to charge users, do we want to do it per subvolume, or per filesystem. My vote >> is per filesystem. Obviously this will make it tricky with snapshots, but I >> think if we''re just charging the diff''s between the original volume and the >> snapshot to the user then that will be the easiest for people to understand, >> rather than making a snapshot all of a sudden count the users currently used >> quota * 2. > > This is going to be tricky to get the semantics right, I suspect. > > Say you''ve created a subvolume, A, containing 10G of Useful Stuff > (say, a base image for VMs). This counts 10G against your quota. Now, > I come along and snapshot that subvolume (as a writable subvolume) -- > call it B. This is essentially free for me, because I''ve got a COW > copy of your subvolume (and the original counts against your quota). > > If I now modify a file in subvolume B, the full modified section > goes onto my quota. This is all well and good. But what happens if you > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra > files. Worse, what happens if someone else had made a snapshot of A, > too? Who gets the 10G added to their quota, me or them? What if I''d > filled up my quota? Would that stop you from deleting your copy, > because my copy can''t be charged against my quota? Would I just end up > unexpectedly 10G over quota? > > This is a whole gigantic can of worms, as far as I can see, and I > don''t think it''s going to be possible to implement quotas, even on a > filesystem level, until there''s some good and functional model for > dealing with all the implications of COW copies. :(I would argue that a simple and probably correct solution is to have the files count toward the quota of everyone who has a COW copy. i.e. if I have a volume A and you make a snapshot B, the du content of B should count toward your quota as well, rather than being "free". I don''t see any reason why this would not be the correct and intuitive way to do it. Simply treat it as you would transparent block-level deduplication. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > === Quotas ==> > > > This is a huge topic in and of itself, but Christoph mentioned wanting to have > > an idea of what we wanted to do with it, so I''m putting it here. There are > > really 2 things here > > > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > > subvolume and at creation time set a maximum size it can grow to and not let it > > go farther than that. Nice, simple and straightforward. > > > > 2) Normal quotas, via the quota tools. This just comes down to how do we want > > to charge users, do we want to do it per subvolume, or per filesystem. My vote > > is per filesystem. Obviously this will make it tricky with snapshots, but I > > think if we''re just charging the diff''s between the original volume and the > > snapshot to the user then that will be the easiest for people to understand, > > rather than making a snapshot all of a sudden count the users currently used > > quota * 2. > > This is going to be tricky to get the semantics right, I suspect. > > Say you''ve created a subvolume, A, containing 10G of Useful Stuff > (say, a base image for VMs). This counts 10G against your quota. Now, > I come along and snapshot that subvolume (as a writable subvolume) -- > call it B. This is essentially free for me, because I''ve got a COW > copy of your subvolume (and the original counts against your quota). > > If I now modify a file in subvolume B, the full modified section > goes onto my quota. This is all well and good. But what happens if you > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra > files. Worse, what happens if someone else had made a snapshot of A, > too? Who gets the 10G added to their quota, me or them? What if I''d > filled up my quota? Would that stop you from deleting your copy, > because my copy can''t be charged against my quota? Would I just end up > unexpectedly 10G over quota? > > This is a whole gigantic can of worms, as far as I can see, and I > don''t think it''s going to be possible to implement quotas, even on a > filesystem level, until there''s some good and functional model for > dealing with all the implications of COW copies. :(In your case, it would sound fair that everyone is "simply" charged 10G. What Josef is refering to would probably only apply to volumes and snapshots owned by the same user: If I have a subvolume of 10G, and a snapshot of it where I only changed 1G, the charged quota would be 11G, not 20G. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 10:38 AM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: >> === Quotas ==>> >> This is a huge topic in and of itself, but Christoph mentioned wanting to have >> an idea of what we wanted to do with it, so I''m putting it here. There are >> really 2 things here >> >> 1) Limiting the size of subvolumes. This is really easy for us, just create a >> subvolume and at creation time set a maximum size it can grow to and not let it >> go farther than that. Nice, simple and straightforward. >> >> 2) Normal quotas, via the quota tools. This just comes down to how do we want >> to charge users, do we want to do it per subvolume, or per filesystem. My vote >> is per filesystem. Obviously this will make it tricky with snapshots, but I >> think if we''re just charging the diff''s between the original volume and the >> snapshot to the user then that will be the easiest for people to understand, >> rather than making a snapshot all of a sudden count the users currently used >> quota * 2. > > This is going to be tricky to get the semantics right, I suspect. > > Say you''ve created a subvolume, A, containing 10G of Useful Stuff > (say, a base image for VMs). This counts 10G against your quota. Now, > I come along and snapshot that subvolume (as a writable subvolume) -- > call it B. This is essentially free for me, because I''ve got a COW > copy of your subvolume (and the original counts against your quota). > > If I now modify a file in subvolume B, the full modified section > goes onto my quota. This is all well and good. But what happens if you > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra > files. Worse, what happens if someone else had made a snapshot of A, > too? Who gets the 10G added to their quota, me or them? What if I''d > filled up my quota? Would that stop you from deleting your copy, > because my copy can''t be charged against my quota? Would I just end up > unexpectedly 10G over quota? > > This is a whole gigantic can of worms, as far as I can see, and I > don''t think it''s going to be possible to implement quotas, even on a > filesystem level, until there''s some good and functional model for > dealing with all the implications of COW copies. :(i''d expect that as a separate user, you should both be whacked 10G. imo, the whole benefit of transparent COW is to the administrators advantage, thus i would even think the _uncompressed_ volume size would go against quota (which could possibly be artificially inflated to account for the space saving of compression). users just need a nice steadily predictable number to monitor. thought maybe these users could be grouped, such that the COW''ed portions of the files they share are balanced across each users quota, but this would have to be a soprt of "opt in" thing else you get the wild fluctuations because of other user''s actions. additionally, some users could be marked as "system", where COW''ing their subvol results in 0 quota -- you only pay for what you change -- but if the system subvol gets removed, then you pay for it all. in this way you would have to keep reusing system subvols to get any advantage as a regular user. i dont know the existing systems though so i dont know what it would take to do such balancing. C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > === Quotas ==> > > > This is a huge topic in and of itself, but Christoph mentioned wanting to have > > an idea of what we wanted to do with it, so I''m putting it here. There are > > really 2 things here > > > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > > subvolume and at creation time set a maximum size it can grow to and not let it > > go farther than that. Nice, simple and straightforward. > > > > 2) Normal quotas, via the quota tools. This just comes down to how do we want > > to charge users, do we want to do it per subvolume, or per filesystem. My vote > > is per filesystem. Obviously this will make it tricky with snapshots, but I > > think if we''re just charging the diff''s between the original volume and the > > snapshot to the user then that will be the easiest for people to understand, > > rather than making a snapshot all of a sudden count the users currently used > > quota * 2. > > This is going to be tricky to get the semantics right, I suspect. > > Say you''ve created a subvolume, A, containing 10G of Useful Stuff > (say, a base image for VMs). This counts 10G against your quota. Now, > I come along and snapshot that subvolume (as a writable subvolume) -- > call it B. This is essentially free for me, because I''ve got a COW > copy of your subvolume (and the original counts against your quota). > > If I now modify a file in subvolume B, the full modified section > goes onto my quota. This is all well and good. But what happens if you > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra > files. Worse, what happens if someone else had made a snapshot of A, > too? Who gets the 10G added to their quota, me or them? What if I''d > filled up my quota? Would that stop you from deleting your copy, > because my copy can''t be charged against my quota? Would I just end up > unexpectedly 10G over quota? >If you delete your subvolume A, like use the btrfs tool to delete it, you will only be stuck with what you changed in snapshot B. So if you only changed 5gig worth of information, and you deleted the original subvolume, you would have 5gig charged to your quota. The idea is you are only charged for what blocks you have on the disk. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, 01 December, 2010, Josef Bacik wrote:> Hello, >Hi Josef> > === What are subvolumes? ==> > They are just another tree. In BTRFS we have various b-trees to describethe> filesystem. A few of them are filesystem wide, such as the extent tree,chunk> tree, root tree etc. The tree''s that hold the actual filesystem data, thatis> inodes and such, are kept in their own b-tree. This is how subvolumes and > snapshots appear on disk, they are simply new b-trees with all of the filedata> contained within them. > > === What do subvolumes look like? ==>[...]> > 2) Obviously you can''t just rm -rf subvolumes. Because they are rootsthere''s> extra metadata to keep track of them, so you have to use one of our ioctlsto> delete subvolumes/snapshots.Sorry, but I can''t understand this sentence. It is clear that a directory and a subvolume have a totally different on-disk format. But why it would be not possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a patch some months ago: when the rmdir is invoked on a subvolume, the same action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed. See https://patchwork.kernel.org/patch/260301/ [...]> > There is one tricky thing. When you create a subvolume, the directory inode > that is created in the parent subvolume has the inode number of 256. So ifyou> have a bunch of subvolumes in the same parent subvolume, you are going tohave a> bunch of directories with the inode number of 256. This is so when users cd > into a subvolume we can know its a subvolume and do all the normal voodoo to > start looking in the subvolumes tree instead of the parent subvolumes tree. > > This is where things go a bit sideways. We had serious problems with NFS,but> thankfully NFS gives us a bunch of hooks to get around these problems. > CIFS/Samba do not, so we will have problems there, not to mention any other > userspace application that looks at inode numbers.How this is/should be different of a mounted filesystem ? For example: # cd /tmp # btrfs subvolume create sub-a # btrfs subvolume create sub-b # mkdir mount -a; mkdir mount-b # mount /dev/sda6 mount-a # an ext4 fs # mount /dev/sdb2 mount-b # an ext3 fs # $ stat -c "%8i %n" sub-a sub-b mount-a mount-b 256 sub-a 256 sub-b 2 mount-a 2 mount-b In this case the inode-number returned are equal for both the mounted filesystems and the subvolumes. However, the fsid is different. # stat -fc "%8i %n" sub-a sub-b mount-a mount-b . cdc937c1a203df74 sub-a cdc937c1a203df77 sub-b b27d147f003561c8 mount-a d49e1a3d2333d2e1 mount-b cdc937c1a203df75 . Moreover I suggest to look at the difference of the inode returned by readdir(3) and stat(3).. [...]> I feel like I''m forgetting something here, hopefully somebody will point itout.>Another point that I want like to discuss is how manage the "pivoting" between the subvolumes. One of the most beautiful feature of btrfs is the snapshot capability. In fact it is possible to make a snapshot of the root of the filesystem and to mount it in a subsequent reboot. But is very complicated to manage the pivoting of a snapshot of a root filesystem, because I cannot delete the "old root" due to the fact that the "new root" is placed in the "old root". A possible solution is not to put the root of the filesystem (where are placed /usr, /etc....) in the root of the btrfs filesystem; but it should be accepted from the beginning the idea that the root of a filesystem should be placed in a subvolume which int turn is placed in the root of a btrfs filesystem... I am open to other opinions.> === Conclusion ==> > There are definitely some wonky things with subvolumes, but I don''t thinkthey> are things that cannot be fixed now. Some of these changes will require > incompat format changes, but it''s either we fix it now, or later on down the > road when BTRFS starts getting used in production really find out how many > things our current scheme breaks and then have to do the changes then.Thanks,> > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:> On Wednesday, 01 December, 2010, Josef Bacik wrote: > > Hello, > > > > Hi Josef > > > > > === What are subvolumes? ==> > > > They are just another tree. In BTRFS we have various b-trees to describe > the > > filesystem. A few of them are filesystem wide, such as the extent tree, > chunk > > tree, root tree etc. The tree''s that hold the actual filesystem data, that > is > > inodes and such, are kept in their own b-tree. This is how subvolumes and > > snapshots appear on disk, they are simply new b-trees with all of the file > data > > contained within them. > > > > === What do subvolumes look like? ==> > > [...] > > > > 2) Obviously you can''t just rm -rf subvolumes. Because they are roots > there''s > > extra metadata to keep track of them, so you have to use one of our ioctls > to > > delete subvolumes/snapshots. > > Sorry, but I can''t understand this sentence. It is clear that a directory and > a subvolume have a totally different on-disk format. But why it would be not > possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a > patch some months ago: when the rmdir is invoked on a subvolume, the same > action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed. > > See https://patchwork.kernel.org/patch/260301/ >Oh hey thats cool. That would be reasonable I think. I was just saying that currently we can''t remove subvolumes/snapshots via rm, not that it wasn''t possible at all. So I think what you did would be a good thing to have.> [...] > > > > There is one tricky thing. When you create a subvolume, the directory inode > > that is created in the parent subvolume has the inode number of 256. So if > you > > have a bunch of subvolumes in the same parent subvolume, you are going to > have a > > bunch of directories with the inode number of 256. This is so when users cd > > into a subvolume we can know its a subvolume and do all the normal voodoo to > > start looking in the subvolumes tree instead of the parent subvolumes tree. > > > > This is where things go a bit sideways. We had serious problems with NFS, > but > > thankfully NFS gives us a bunch of hooks to get around these problems. > > CIFS/Samba do not, so we will have problems there, not to mention any other > > userspace application that looks at inode numbers. > > How this is/should be different of a mounted filesystem ? > For example: > > # cd /tmp > # btrfs subvolume create sub-a > # btrfs subvolume create sub-b > # mkdir mount -a; mkdir mount-b > # mount /dev/sda6 mount-a # an ext4 fs > # mount /dev/sdb2 mount-b # an ext3 fs > # $ stat -c "%8i %n" sub-a sub-b mount-a mount-b > 256 sub-a > 256 sub-b > 2 mount-a > 2 mount-b > > In this case the inode-number returned are equal for both the mounted > filesystems and the subvolumes. However, the fsid is different. > > # stat -fc "%8i %n" sub-a sub-b mount-a mount-b . > cdc937c1a203df74 sub-a > cdc937c1a203df77 sub-b > b27d147f003561c8 mount-a > d49e1a3d2333d2e1 mount-b > cdc937c1a203df75 . > > Moreover I suggest to look at the difference of the inode returned by > readdir(3) and stat(3).. >Yeah you are right, the inode numbering can probably be the same, we just need to make them logically different mounts so things like NFS and samba still work right.> [...] > > I feel like I''m forgetting something here, hopefully somebody will point it > out. > > > > Another point that I want like to discuss is how manage the "pivoting" between > the subvolumes. One of the most beautiful feature of btrfs is the snapshot > capability. In fact it is possible to make a snapshot of the root of the > filesystem and to mount it in a subsequent reboot. > But is very complicated to manage the pivoting of a snapshot of a root > filesystem, because I cannot delete the "old root" due to the fact that the > "new root" is placed in the "old root". > > A possible solution is not to put the root of the filesystem (where are placed > /usr, /etc....) in the root of the btrfs filesystem; but it should be accepted > from the beginning the idea that the root of a filesystem should be placed in > a subvolume which int turn is placed in the root of a btrfs filesystem... > > I am open to other opinions. >Agreed, one of the things that Chris and I have discussed is the possiblity of just having dangling roots, since really the directories are just an easy way to get to the subvolumes. This would let you delete the original volume and use the snapshot from then on out. Something to do in the future for sure. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <josef@redhat.com> wrote:> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote: > >> Another point that I want like to discuss is how manage the "pivoting" between >> the subvolumes. One of the most beautiful feature of btrfs is the snapshot >> capability. In fact it is possible to make a snapshot of the root of the >> filesystem and to mount it in a subsequent reboot. >> But is very complicated to manage the pivoting of a snapshot of a root >> filesystem, because I cannot delete the "old root" due to the fact that the >> "new root" is placed in the "old root". >> >> A possible solution is not to put the root of the filesystem (where are placed >> /usr, /etc....) in the root of the btrfs filesystem; but it should be accepted >> from the beginning the idea that the root of a filesystem should be placed in >> a subvolume which int turn is placed in the root of a btrfs filesystem... >> >> I am open to other opinions. >> > > Agreed, one of the things that Chris and I have discussed is the possiblity of > just having dangling roots, since really the directories are just an easy way to > get to the subvolumes. This would let you delete the original volume and use > the snapshot from then on out. Something to do in the future for sure.i would really like to see a solution to this particular issue. i may be missing something, but the dangling subvol roots doesn''t seem to address the management of the root volume itself. for example... most people will install their whole system into the real root (id=5), but this renders the system unmanageable, because there is no way to ever empty it without manually issuing an `rm -rf`. i''m having a really hard time controlling this with the initramfs hook i provide for archlinux users. the hook requires a specific structure "underneath" what the user perceives as /, but i can only accomplish this for new installs -- for existing installs i can setup the proper "subroot" structure, and snapshot their current root... but i cannot remove the stagnant files in the real root (id=5) that well never, ever be accessed again. ... or does dangling roots address this? C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 12:48 PM, C Anthony Risinger <anthony@extof.me> wrote:> On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <josef@redhat.com> wrote: >> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote: >> >>> Another point that I want like to discuss is how manage the "pivoting" between >>> the subvolumes. One of the most beautiful feature of btrfs is the snapshot >>> capability. In fact it is possible to make a snapshot of the root of the >>> filesystem and to mount it in a subsequent reboot. >>> But is very complicated to manage the pivoting of a snapshot of a root >>> filesystem, because I cannot delete the "old root" due to the fact that the >>> "new root" is placed in the "old root". >>> >>> A possible solution is not to put the root of the filesystem (where are placed >>> /usr, /etc....) in the root of the btrfs filesystem; but it should be accepted >>> from the beginning the idea that the root of a filesystem should be placed in >>> a subvolume which int turn is placed in the root of a btrfs filesystem... >>> >>> I am open to other opinions. >>> >> >> Agreed, one of the things that Chris and I have discussed is the possiblity of >> just having dangling roots, since really the directories are just an easy way to >> get to the subvolumes. This would let you delete the original volume and use >> the snapshot from then on out. Something to do in the future for sure. > > i would really like to see a solution to this particular issue. i may > be missing something, but the dangling subvol roots doesn''t seem to > address the management of the root volume itself. > > for example... most people will install their whole system into the > real root (id=5), but this renders the system unmanageable, because > there is no way to ever empty it without manually issuing an `rm -rf`. > > i''m having a really hard time controlling this with the initramfs hook > i provide for archlinux users. the hook requires a specific structure > "underneath" what the user perceives as /, but i can only accomplish > this for new installs -- for existing installs i can setup the proper > "subroot" structure, and snapshot their current root... but i cannot > remove the stagnant files in the real root (id=5) that well never, > ever be accessed again. > > ... or does dangling roots address this?i forgot to mention, but a quick ''n dirty solution would be to simply not enable users to do this by accident. mkfs.btrfs could create a new subvol, then mark it as default... this way the user has to manually mount with id=0, or remark 0 as the default. effectively, users would be unknowingly be installing into a subvolume, rather then the top-level root (apologies if my terminology is incorrect). C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, 01 December, 2010, you (C Anthony Risinger) wrote: [...]> i forgot to mention, but a quick ''n dirty solution would be to simply > not enable users to do this by accident. mkfs.btrfs could create a > new subvol, then mark it as default... this way the user has to > manually mount with id=0, or remark 0 as the default. > > effectively, users would be unknowingly be installing into a > subvolume, rather then the top-level root (apologies if my terminology > is incorrect).I fully agree: it fulfill the KISS principle :-)> C Anthony >-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:> On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote: > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > > === Quotas ==> > > > > > This is a huge topic in and of itself, but Christoph mentioned wanting to have > > > an idea of what we wanted to do with it, so I''m putting it here. There are > > > really 2 things here > > > > > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > > > subvolume and at creation time set a maximum size it can grow to and not let it > > > go farther than that. Nice, simple and straightforward. > > > > > > 2) Normal quotas, via the quota tools. This just comes down to how do we want > > > to charge users, do we want to do it per subvolume, or per filesystem. My vote > > > is per filesystem. Obviously this will make it tricky with snapshots, but I > > > think if we''re just charging the diff''s between the original volume and the > > > snapshot to the user then that will be the easiest for people to understand, > > > rather than making a snapshot all of a sudden count the users currently used > > > quota * 2. > > > > This is going to be tricky to get the semantics right, I suspect. > > > > Say you''ve created a subvolume, A, containing 10G of Useful Stuff > > (say, a base image for VMs). This counts 10G against your quota. Now, > > I come along and snapshot that subvolume (as a writable subvolume) -- > > call it B. This is essentially free for me, because I''ve got a COW > > copy of your subvolume (and the original counts against your quota). > > > > If I now modify a file in subvolume B, the full modified section > > goes onto my quota. This is all well and good. But what happens if you > > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra > > files. Worse, what happens if someone else had made a snapshot of A, > > too? Who gets the 10G added to their quota, me or them? What if I''d > > filled up my quota? Would that stop you from deleting your copy, > > because my copy can''t be charged against my quota? Would I just end up > > unexpectedly 10G over quota? > > > > If you delete your subvolume A, like use the btrfs tool to delete it, you will > only be stuck with what you changed in snapshot B. So if you only changed 5gig > worth of information, and you deleted the original subvolume, you would have > 5gig charged to your quota.This doesn''t work, though, if the owners of the "original" and "new" subvolume are different: Case 1: * Porthos creates 10G data. * Athos makes a snapshot of Porthos''s data. * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of Porthos''s data to Athos. * Porthos deletes his copy of the data. Case 2: * Porthos creates 10G of data. * Athos makes a snapshot of Porthos''s data. * Porthos deletes his copy of the data. * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of Porthos''s data to Athos. Case 3: * Porthos creates 10G data. * Athos makes a snapshot of Porthos''s data. * Aramis makes a snapshot of Porthos''s data. * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of Porthos''s data to Athos. * Porthos deletes his copy of the data. Case 4: * Porthos creates 10G data. * Athos makes a snapshot of Porthos''s data. * Aramis makes a snapshot of Athos''s data. * Porthos deletes his copy of the data. [Consider also Richelieu changing ownerships of Athos''s and Aramis''s data at alternative points in this sequence] In each of these, who gets charged (and how much) for their copy of the data?> The idea is you are only charged for what blocks > you have on the disk. Thanks,My point was that it''s perfectly possible to have blocks on the disk that are effectively owned by two people, and that the person to charge for those blocks is, to me, far from clear. You either end up charging twice for a single set of blocks on the disk, or you end up in a situation where one person''s actions can cause another person''s quota to fill up. Neither of these is particularly obvious behaviour. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I believe that it''s closely correlated with --- the aeroswine coefficient.
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:> Hello, > > Various people have complained about how BTRFS deals with subvolumes recently, > specifically the fact that they all have the same inode number, and there''s no > discrete seperation from one subvolume to another. Christoph asked that I lay > out a basic design document of how we want subvolumes to work so we can hash > everything out now, fix what is broken, and then move forward with a design that > everybody is more or less happy with. I apologize in advance for how freaking > long this email is going to be. I assume that most people are generally > familiar with how BTRFS works, so I''m not going to bother explaining in great > detail some stuff. > > === What are subvolumes? ==> > They are just another tree. In BTRFS we have various b-trees to describe the > filesystem. A few of them are filesystem wide, such as the extent tree, chunk > tree, root tree etc. The tree''s that hold the actual filesystem data, that is > inodes and such, are kept in their own b-tree. This is how subvolumes and > snapshots appear on disk, they are simply new b-trees with all of the file data > contained within them. > > === What do subvolumes look like? ==> > All the user sees are directories. They act like any other directory acts, with > a few exceptions > > 1) You cannot hardlink between subvolumes. This is because subvolumes have > their own inode numbers and such, think of them as seperate mounts in this case, > you cannot hardlink between two mounts because the link needs to point to the > same on disk inode, which is impossible between two different filesystems. The > same is true for subvolumes, they have their own trees with their own inodes and > inode numbers, so it''s impossible to hardlink between them.OK, so I''m unclear: would it be possible for nfsd to export subvolumes independently? For that to work, we need to be able to take an inode that we just looked up by filehandle, and see which subvolume it belongs in. So if two subvolumes can point to the same inode, it doesn''t work, but if st_dev is different between them, e.g., that''d be fine. Sounds like you''re seeing the latter is possible, good!> > 1a) In case it wasn''t clear from above, each subvolume has their own inode > numbers, so you can have the same inode numbers used between two different > subvolumes, since they are two different trees. > > 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > extra metadata to keep track of them, so you have to use one of our ioctls to > delete subvolumes/snapshots. > > But permissions and everything else they are the same. > > There is one tricky thing. When you create a subvolume, the directory inode > that is created in the parent subvolume has the inode number of 256.Is that the right way to say this? Doing a quick test, the inode numbers that a readdir of the parent directory returns *are* distinct. It''s just the inode number that you get when you stat that is different. Which is all fine and normal, *if* you treat this as a real mountpoint with its own vfsmount, st_dev, etc.> === How do we want subvolumes to work from a user perspective? ==> > 1) Users need to be able to create their own subvolumes. The permission > semantics will be absolutely the same as creating directories, so I don''t think > this is too tricky. We want this because you can only take snapshots of > subvolumes, and so it is important that users be able to create their own > discrete snapshottable targets. > > 2) Users need to be able to snapshot their subvolumes. This is basically the > same as #1, but it bears repeating. > > 3) Subvolumes shouldn''t need to be specifically mounted. This is also > important, we don''t want users to have to go around mounting their subvolumes up > manually one-by-one. Today users just cd into subvolumes and it works, just > like cd''ing into a directory.And the separate nfsd exports is another thing I''d really love to see work: currently you can export a subtree of a filesystem if you want, but it''s trivial to escape the subtree by guessing filehandles. So this gives us an easy way for administrators to create secure separate exports without having to manage entirely separate volumes. If subvolumes got real mountpoints and so on, this would be easy. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 02:44:04PM -0500, J. Bruce Fields wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > Hello, > > > > Various people have complained about how BTRFS deals with subvolumes recently, > > specifically the fact that they all have the same inode number, and there''s no > > discrete seperation from one subvolume to another. Christoph asked that I lay > > out a basic design document of how we want subvolumes to work so we can hash > > everything out now, fix what is broken, and then move forward with a design that > > everybody is more or less happy with. I apologize in advance for how freaking > > long this email is going to be. I assume that most people are generally > > familiar with how BTRFS works, so I''m not going to bother explaining in great > > detail some stuff. > > > > === What are subvolumes? ==> > > > They are just another tree. In BTRFS we have various b-trees to describe the > > filesystem. A few of them are filesystem wide, such as the extent tree, chunk > > tree, root tree etc. The tree''s that hold the actual filesystem data, that is > > inodes and such, are kept in their own b-tree. This is how subvolumes and > > snapshots appear on disk, they are simply new b-trees with all of the file data > > contained within them. > > > > === What do subvolumes look like? ==> > > > All the user sees are directories. They act like any other directory acts, with > > a few exceptions > > > > 1) You cannot hardlink between subvolumes. This is because subvolumes have > > their own inode numbers and such, think of them as seperate mounts in this case, > > you cannot hardlink between two mounts because the link needs to point to the > > same on disk inode, which is impossible between two different filesystems. The > > same is true for subvolumes, they have their own trees with their own inodes and > > inode numbers, so it''s impossible to hardlink between them. > > OK, so I''m unclear: would it be possible for nfsd to export subvolumes > independently? >Yeah.> For that to work, we need to be able to take an inode that we just > looked up by filehandle, and see which subvolume it belongs in. So if > two subvolumes can point to the same inode, it doesn''t work, but if > st_dev is different between them, e.g., that''d be fine. Sounds like > you''re seeing the latter is possible, good! >So you can''t have the same inode in two subvolumes, since they are different trees. You can have the same inode numbers between two subvolumes, because they are different trees.> > > > 1a) In case it wasn''t clear from above, each subvolume has their own inode > > numbers, so you can have the same inode numbers used between two different > > subvolumes, since they are two different trees. > > > > 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > > extra metadata to keep track of them, so you have to use one of our ioctls to > > delete subvolumes/snapshots. > > > > But permissions and everything else they are the same. > > > > There is one tricky thing. When you create a subvolume, the directory inode > > that is created in the parent subvolume has the inode number of 256. > > Is that the right way to say this? Doing a quick test, the inode > numbers that a readdir of the parent directory returns *are* distinct. > It''s just the inode number that you get when you stat that is different. > > Which is all fine and normal, *if* you treat this as a real mountpoint > with its own vfsmount, st_dev, etc. >Oh well crud, I was hoping that I could leave the inode numbers as 256 for everything, but I forgot about readdir. So the inode item in the parent would have to have a unique inode number that would get spit out in readdir, but then if we stat''ed the directory we''d get 256 for the inode number. Oh well, incompat flag it is then.> > === How do we want subvolumes to work from a user perspective? ==> > > > 1) Users need to be able to create their own subvolumes. The permission > > semantics will be absolutely the same as creating directories, so I don''t think > > this is too tricky. We want this because you can only take snapshots of > > subvolumes, and so it is important that users be able to create their own > > discrete snapshottable targets. > > > > 2) Users need to be able to snapshot their subvolumes. This is basically the > > same as #1, but it bears repeating. > > > > 3) Subvolumes shouldn''t need to be specifically mounted. This is also > > important, we don''t want users to have to go around mounting their subvolumes up > > manually one-by-one. Today users just cd into subvolumes and it works, just > > like cd''ing into a directory. > > And the separate nfsd exports is another thing I''d really love to see > work: currently you can export a subtree of a filesystem if you want, > but it''s trivial to escape the subtree by guessing filehandles. So this > gives us an easy way for administrators to create secure separate > exports without having to manage entirely separate volumes. > > If subvolumes got real mountpoints and so on, this would be easy.Thats the idea, we''ll see how well it works out ;). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:> Oh well crud, I was hoping that I could leave the inode numbers as 256 for > everything, but I forgot about readdir. So the inode item in the parent would > have to have a unique inode number that would get spit out in readdir, but then > if we stat''ed the directory we''d get 256 for the inode number. Oh well, > incompat flag it is then.I think you''re already fine: # mkdir TMP # dd if=/dev/zero of=TMP-image bs=1M count=512 # mkfs.btrfs TMP-image # mount -oloop TMP-image TMP/ # btrfs subvolume create sub-a # btrfs subvolume create sub-b ../readdir-inos . . 256 256 .. 256 4130609 sub-a 256 256 sub-b 257 256 Where readdir-inos is my silly test program below, and the first number is from readdir, the second from stat. ? --b. #include <stdio.h> #include <err.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <dirent.h> /* demonstrate that for mountpoints, readdir ino of mounted-on * directory, stat returns ino of mounted directory. */ int main(int argc, char *argv[]) { struct dirent *de; int ret; DIR *d; if (argc != 2) errx(1, "usage: %s <directory>", argv[0]); ret = chdir(argv[1]); if (ret) errx(1, "chdir /"); d = opendir("."); if (!d) errx(1, "opendir ."); while (de = readdir(d)) { struct stat st; ret = stat(de->d_name, &st); if (ret) errx(1, "stat %s", de->d_name); printf("%s %d %d\n", de->d_name, de->d_ino, st.st_ino); } } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 1 Dec 2010 09:21:36 -0500 Josef Bacik <josef@redhat.com> wrote:> There is one tricky thing. When you create a subvolume, the directory inode > that is created in the parent subvolume has the inode number of 256. So if you > have a bunch of subvolumes in the same parent subvolume, you are going to have a > bunch of directories with the inode number of 256. This is so when users cd > into a subvolume we can know its a subvolume and do all the normal voodoo to > start looking in the subvolumes tree instead of the parent subvolumes tree. > > This is where things go a bit sideways. We had serious problems with NFS, but > thankfully NFS gives us a bunch of hooks to get around these problems. > CIFS/Samba do not, so we will have problems there, not to mention any other > userspace application that looks at inode numbers.A more common use case than CIFS or samba is going to be things like backup programs. They commonly look at inode numbers in order to identify hardlinks and may be horribly confused when there files that have a link count >1 and inode number collisions with other files. That probably qualifies as an "enterprise-ready" show stopper...> === What do we do? ==> > This is where I expect to see the most discussion. Here is what I want to do > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > that way. This unfortunately will be an incompatible format change, but the > sooner we get this adressed the easier it will be in the long run. Obviously > when I say format change I mean via the incompat bits we have, so old fs''s won''t > be broken and such. > > 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > just do dentry trickery, but that doesn''t make the boundary between subvolumes > clear, so it will confuse people (and samba) when they walk into a subvolume and > all of a sudden the inode numbers are the same as in the directory behind them. > With doing the referral mount thing, each subvolume appears to be its own mount > and that way things like NFS and samba will work properly. >Sounds like you''re on the right track. The key concept is really that an inode number should be unique within the scope of the st_dev. The simplest solution for you here is simply to give each subvol its own st_dev and mount it up via a shrinkable mount automagically when someone walks into the directory. In addition to the examples of this in NFS, CIFS does this for DFS referrals. Today, this is mostly done by hijacking the follow_link operation, but David Howells proposed some patches a while back to do this via a more formalized interface. It may be reasonable to target this work on top of that, depending on the state of those changes... -- Jeff Layton <jlayton@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:> On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote: > > Oh well crud, I was hoping that I could leave the inode numbers as 256 for > > everything, but I forgot about readdir. So the inode item in the parent would > > have to have a unique inode number that would get spit out in readdir, but then > > if we stat''ed the directory we''d get 256 for the inode number. Oh well, > > incompat flag it is then. > > I think you''re already fine: > > # mkdir TMP > # dd if=/dev/zero of=TMP-image bs=1M count=512 > # mkfs.btrfs TMP-image > # mount -oloop TMP-image TMP/ > # btrfs subvolume create sub-a > # btrfs subvolume create sub-b > ../readdir-inos . > . 256 256 > .. 256 4130609 > sub-a 256 256 > sub-b 257 256 > > Where readdir-inos is my silly test program below, and the first number is from > readdir, the second from stat. >Heh as soon as I typed my email I went and actually looked at the code, looks like for readdir we fill in the root id, which will be unique, so hotdamn we are good and I don''t have to use a stupid incompat flag. Thanks for checking that :), Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote: > > On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote: > > > Oh well crud, I was hoping that I could leave the inode numbers as 256 for > > > everything, but I forgot about readdir. So the inode item in the parent would > > > have to have a unique inode number that would get spit out in readdir, but then > > > if we stat''ed the directory we''d get 256 for the inode number. Oh well, > > > incompat flag it is then. > > > > I think you''re already fine: > > > > # mkdir TMP > > # dd if=/dev/zero of=TMP-image bs=1M count=512 > > # mkfs.btrfs TMP-image > > # mount -oloop TMP-image TMP/ > > # btrfs subvolume create sub-a > > # btrfs subvolume create sub-b > > ../readdir-inos . > > . 256 256 > > .. 256 4130609 > > sub-a 256 256 > > sub-b 257 256 > > > > Where readdir-inos is my silly test program below, and the first number is from > > readdir, the second from stat. > > > > Heh as soon as I typed my email I went and actually looked at the code, looks > like for readdir we fill in the root id, which will be unique, so hotdamn we are > good and I don''t have to use a stupid incompat flag. Thanks for checking that > :),My only complaint was just about how you said this: "When you create a subvolume, the directory inode that is created in the parent subvolume has the inode number of 256" If you revise that you might want to clarify. (Maybe "Every subvolume has a root directory inode with inode number 256"?) The way you''ve stated it sounds like you''re talking about the readdir-returned number, which would normally come from the inode that has been covered up by the mount, and which really is an inode in the parent filesystem.... --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote:> On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote: >> If you delete your subvolume A, like use the btrfs tool to delete it, you will >> only be stuck with what you changed in snapshot B. So if you only changed 5gig >> worth of information, and you deleted the original subvolume, you would have >> 5gig charged to your quota. > > This doesn''t work, though, if the owners of the "original" and > "new" subvolume are different: > > Case 1: > > * Porthos creates 10G data. > * Athos makes a snapshot of Porthos''s data. > * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of > Porthos''s data to Athos. > * Porthos deletes his copy of the data. > > Case 2: > > * Porthos creates 10G of data. > * Athos makes a snapshot of Porthos''s data. > * Porthos deletes his copy of the data. > * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of > Porthos''s data to Athos. > > Case 3: > > * Porthos creates 10G data. > * Athos makes a snapshot of Porthos''s data. > * Aramis makes a snapshot of Porthos''s data. > * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of > Porthos''s data to Athos. > * Porthos deletes his copy of the data. > > Case 4: > > * Porthos creates 10G data. > * Athos makes a snapshot of Porthos''s data. > * Aramis makes a snapshot of Athos''s data. > * Porthos deletes his copy of the data. > [Consider also Richelieu changing ownerships of Athos''s and Aramis''s > data at alternative points in this sequence] > > In each of these, who gets charged (and how much) for their copy of > the data? > >> The idea is you are only charged for what blocks >> you have on the disk. Thanks, > > My point was that it''s perfectly possible to have blocks on the > disk that are effectively owned by two people, and that the person to > charge for those blocks is, to me, far from clear. You either end up > charging twice for a single set of blocks on the disk, or you end up > in a situation where one person''s actions can cause another person''s > quota to fill up. Neither of these is particularly obvious behaviour.As a sysadmin and as a user, quotas shouldn''t be about "physical blocks of storage used" but should be about "logical storage used". IOW, if the filesystem is compressed, using 1 GB of physical space to store 10 GB of data, my "quota used" should be 10 GB. Similar for deduplication. The quota is based on the storage *before* the file is deduped. Not after. Similar for snapshots. If UserA has 10 GB of quota used, I snapshot their filesystem, then my "quota used" would be 10 GB as well. As data in my snapshot changes, my "quota used" is updated to reflect that (change 1 GB of data compared to snapshot, use 1 GB of quota). You have to (or at least should) keep two sets of stats for storage usage: - logical amount used ("real" file size, before compression, before de-dupe, before snapshots, etc) - physical amount used (what''s actually written to disk) User-level quotas are based on the logical storage used. Admin-level quotas (if you want to implement them) would be based on physical storage used. Thus, the output of things like df, du, ls would show the "logical" storage used and file sizes. And you would either have an additional option to those apps (--real or something) to show the "actual" storage used and file sizes as stored on disk. Trying to make quotas and disk usage utilities to work based on what''s physically on disk is just backwards, imo. And prone to a lot of confusion. -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, 01 December, 2010, Jeff Layton wrote:> A more common use case than CIFS or samba is going to be things like > backup programs. They commonly look at inode numbers in order to > identify hardlinks and may be horribly confused when there files that > have a link count >1 and inode number collisions with other files. > > That probably qualifies as an "enterprise-ready" show stopper...I hope that a backup program, uses the pair (inode,fsid) to identify if two file are hardlinked... otherwise a backup of two filesystem mounted can be quite danguerous... From the statfs(2) man page: [..] The f_fsid field [...] The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the file-system type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the file system when NFS-exported, and giving it out is a security concern. And the btrfs_statfs function returns a different fsid for every subvolume. -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 1 Dec 2010 21:46:03 +0100 Goffredo Baroncelli <kreijack@libero.it> wrote:> On Wednesday, 01 December, 2010, Jeff Layton wrote: > > A more common use case than CIFS or samba is going to be things like > > backup programs. They commonly look at inode numbers in order to > > identify hardlinks and may be horribly confused when there files that > > have a link count >1 and inode number collisions with other files. > > > > That probably qualifies as an "enterprise-ready" show stopper... > > I hope that a backup program, uses the pair (inode,fsid) to identify if two > file are hardlinked... otherwise a backup of two filesystem mounted can be > quite danguerous... > > > From the statfs(2) man page: > [..] > The f_fsid field > [...] > The general idea is that f_fsid contains some random stuff such that the pair > (f_fsid,ino) uniquely determines a file. Some operating systems use (a > variation on) the device number, or the device number combined with the > file-system type. Several OSes restrict giving out the f_fsid field to the > superuser only (and zero it for unprivileged users), because this field is > used in the filehandle of the file system when NFS-exported, and giving it out > is a security concern. > > > And the btrfs_statfs function returns a different fsid for every subvolume. >Ahh, interesting. I''ve never read that blurb on f_fsid... Unfortunately, it looks like not all filesystems fill that field out. NFS and CIFS leave it conspicuously blank. Those are probably bugs... OTOH, the GLibc docs say this: dev_t st_dev Identifies the device containing the file. The st_ino and st_dev, taken together, uniquely identify the file. The st_dev value is not necessarily consistent across reboots or system crashes, however. ...and it''s always been my understanding that a st_dev/st_ino combination should be unique. Is there some definitive POSIX statement on why one should prefer to use f_fsid over st_dev in this situation? -- Jeff Layton <jlayton@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote: > >> The idea is you are only charged for what blocks > >> you have on the disk. Thanks, > > > > My point was that it''s perfectly possible to have blocks on the > > disk that are effectively owned by two people, and that the person to > > charge for those blocks is, to me, far from clear. You either end up > > charging twice for a single set of blocks on the disk, or you end up > > in a situation where one person''s actions can cause another person''s > > quota to fill up. Neither of these is particularly obvious behaviour. > > As a sysadmin and as a user, quotas shouldn''t be about "physical > blocks of storage used" but should be about "logical storage used". > > IOW, if the filesystem is compressed, using 1 GB of physical space to > store 10 GB of data, my "quota used" should be 10 GB. > > Similar for deduplication. The quota is based on the storage *before* > the file is deduped. Not after. > > Similar for snapshots. If UserA has 10 GB of quota used, I snapshot > their filesystem, then my "quota used" would be 10 GB as well. As > data in my snapshot changes, my "quota used" is updated to reflect > that (change 1 GB of data compared to snapshot, use 1 GB of quota).So if I''ve got 10G of data, and I snapshot it, I''ve just used another 10G of quota?> You have to (or at least should) keep two sets of stats for storage usage: > - logical amount used ("real" file size, before compression, before > de-dupe, before snapshots, etc) > - physical amount used (what''s actually written to disk) > > User-level quotas are based on the logical storage used. > Admin-level quotas (if you want to implement them) would be based on > physical storage used. > > Thus, the output of things like df, du, ls would show the "logical" > storage used and file sizes. And you would either have an additional > option to those apps (--real or something) to show the "actual" > storage used and file sizes as stored on disk. > > Trying to make quotas and disk usage utilities to work based on what''s > physically on disk is just backwards, imo. And prone to a lot of > confusion.Trying to make quotas work based on what''s physically on the disk appears to have serious issues on the semantics of "using up space", so I agree with you on this point (and, indeed, it was the point I was trying to make). However, doing it that way also effectively penalises users and prevents (or severely discourages) them from using the advanced functions of the filesystem. There''s no benefit (in disk usage terms) to the user in using a snapshot -- they might as well use plain cp. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I believe that it''s closely correlated with --- the aeroswine coefficient.
On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote:> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote: >> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote: >> >> The idea is you are only charged for what blocks >> >> you have on the disk. Thanks, >> > >> > My point was that it''s perfectly possible to have blocks on the >> > disk that are effectively owned by two people, and that the person to >> > charge for those blocks is, to me, far from clear. You either end up >> > charging twice for a single set of blocks on the disk, or you end up >> > in a situation where one person''s actions can cause another person''s >> > quota to fill up. Neither of these is particularly obvious behaviour. >> >> As a sysadmin and as a user, quotas shouldn''t be about "physical >> blocks of storage used" but should be about "logical storage used". >> >> IOW, if the filesystem is compressed, using 1 GB of physical space to >> store 10 GB of data, my "quota used" should be 10 GB. >> >> Similar for deduplication. The quota is based on the storage *before* >> the file is deduped. Not after. >> >> Similar for snapshots. If UserA has 10 GB of quota used, I snapshot >> their filesystem, then my "quota used" would be 10 GB as well. As >> data in my snapshot changes, my "quota used" is updated to reflect >> that (change 1 GB of data compared to snapshot, use 1 GB of quota). > > So if I''ve got 10G of data, and I snapshot it, I''ve just used > another 10G of quota?Sorry, forgot the "per user" bit above. If UserA has 10 GB of data, then UserB snapshots it, UserB''s quota usage is 10 GB. If UserA has 10 GB of data and snapshots it, then only 10 GB of quota usage is used, as there is 0 difference between the snapshot and the filesystem. As UserA modifies data, their quota usage increases by the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data == 11 GB quota usage). If you combine the two scenarios, you end up with: - UserA has 10 GB of data == 10 GB quota usage - UserB snapshots UserA''s filesystem (clone), so UserB has 10 GB quota usage (even though 0 blocks have changed on disk) - UserA snapshots UserA''s filesystem == no change to quota usage (no blocks on disk have changed) - UserA modifies 1 GB of data in the filesystem == 1 GB new quota usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus the 10 GB in the snapshot) - UserB still only has 10 GB quota usage, since their snapshot hasn''t changed (0 blocks changed) If UserA deletes their filesystem and all their snapshots, freeing up 11 GB of quota usage on their account, UserB''s quota will still be 10 GB, and the blocks on the disk aren''t actually removed (still referenced by UserB''s snapshot). Basically, within a user''s account, only the data unique to a snapshot should count toward the quota. Across accounts, the original (root) snapshot would count completely to the new user''s quota, and then only data unique to subsequent snapshots would count. I hope that makes it more clear. :) All the different layers and whatnot get confusing. :) -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote: >> I think you''re already fine: >> >> # mkdir TMP >> # dd if=/dev/zero of=TMP-image bs=1M count=512 >> # mkfs.btrfs TMP-image >> # mount -oloop TMP-image TMP/ >> # btrfs subvolume create sub-a >> # btrfs subvolume create sub-b >> ../readdir-inos . >> . 256 256 >> .. 256 4130609 >> sub-a 256 256 >> sub-b 257 256 >> >> Where readdir-inos is my silly test program below, and the first >> number is from readdir, the second from stat. >> > > Heh as soon as I typed my email I went and actually looked at the > code, looks like for readdir we fill in the root id, which will be > unique, so hotdamn we are good and I don''t have to use a stupid > incompat flag. Thanks for checking that :),Except, aren''t the inode numbers within a filesystem and the sunbvolume tree IDs allocated out of separate namespaces? I don''t think there''s anything preventing a file/directory from having an inode number that clashes with one of the snapshots. In fact, this already happens in the example above: "." (inode 256 in the root subvolume) and "sub-a" (subvolume ID 256). (Though I still don''t understand the semantics well enough to say whether we need all the inode numbers returned by readdir to be distinct.) --Michael Vrable -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 1, 2010 at 3:32 PM, Freddie Cash <fjwcash@gmail.com> wrote:> On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote: >> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote: >>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote: >>> >> The idea is you are only charged for what blocks >>> >> you have on the disk. Thanks, >>> > >>> > My point was that it''s perfectly possible to have blocks on the >>> > disk that are effectively owned by two people, and that the person to >>> > charge for those blocks is, to me, far from clear. You either end up >>> > charging twice for a single set of blocks on the disk, or you end up >>> > in a situation where one person''s actions can cause another person''s >>> > quota to fill up. Neither of these is particularly obvious behaviour. >>> >>> As a sysadmin and as a user, quotas shouldn''t be about "physical >>> blocks of storage used" but should be about "logical storage used". >>> >>> IOW, if the filesystem is compressed, using 1 GB of physical space to >>> store 10 GB of data, my "quota used" should be 10 GB. >>> >>> Similar for deduplication. The quota is based on the storage *before* >>> the file is deduped. Not after. >>> >>> Similar for snapshots. If UserA has 10 GB of quota used, I snapshot >>> their filesystem, then my "quota used" would be 10 GB as well. As >>> data in my snapshot changes, my "quota used" is updated to reflect >>> that (change 1 GB of data compared to snapshot, use 1 GB of quota). >> >> So if I''ve got 10G of data, and I snapshot it, I''ve just used >> another 10G of quota? > > Sorry, forgot the "per user" bit above. > > If UserA has 10 GB of data, then UserB snapshots it, UserB''s quota > usage is 10 GB. > > If UserA has 10 GB of data and snapshots it, then only 10 GB of quota > usage is used, as there is 0 difference between the snapshot and the > filesystem. As UserA modifies data, their quota usage increases by > the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data > == 11 GB quota usage). > > If you combine the two scenarios, you end up with: > - UserA has 10 GB of data == 10 GB quota usage > - UserB snapshots UserA''s filesystem (clone), so UserB has 10 GB > quota usage (even though 0 blocks have changed on disk)Please define where the owner of a subvolume/snapshot is stored. To my knowledge when you make a snapshot, you have the same set of files with the same set of owners and groups. Whatever user does the snapshot this does not change this unless chown or chgrp are used. Also a non-root user (or a process without CAP_whatever) should not be able to snapshot a subvolume where the root directory of that subvolume is not owned by the user attempting the snapshot. If you do not do so then you end up with the same security and quota issues that hard links have when you don''t have separate filesystems. You could have separate subvolumes for / and /home/foo and user foo could snapshot / to /home/foo/exploit_later_001 and then foo can just wait for an exploit to come along for one of the binaries or libs in /home/foo/exploit_later_001 and own. Yes, snapshot creation should be more restricted than hard links, for good reason. I have other questions but the answer to this fundamental game changer may solve many of the mentioned issues.> - UserA snapshots UserA''s filesystem == no change to quota usage (no > blocks on disk have changed) > - UserA modifies 1 GB of data in the filesystem == 1 GB new quota > usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus > the 10 GB in the snapshot) > - UserB still only has 10 GB quota usage, since their snapshot > hasn''t changed (0 blocks changed) > > If UserA deletes their filesystem and all their snapshots, freeing up > 11 GB of quota usage on their account, UserB''s quota will still be 10 > GB, and the blocks on the disk aren''t actually removed (still > referenced by UserB''s snapshot). > > Basically, within a user''s account, only the data unique to a snapshot > should count toward the quota. > > Across accounts, the original (root) snapshot would count completely > to the new user''s quota, and then only data unique to subsequent > snapshots would count. > > I hope that makes it more clear. :) All the different layers and > whatnot get confusing. :)-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote:> > This is a huge topic in and of itself, but Christoph mentioned wanting to have > an idea of what we wanted to do with it, so I''m putting it here. There are > really 2 things here > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > subvolume and at creation time set a maximum size it can grow to and not let it > go farther than that. Nice, simple and straightforward. >I''d love to be able to limit the size of a subvolume. Here the size comprises all blocks this subvolume refers to. But at least as important to me is a mode where one can build groups of sub- volumes and snapshots and define a quota for the complete group. Again, the size here comprises all blocks any of the subvolumes/snapshots refer to. If a block is referred to more than once, it counts only once. A subvolume/snapshot can be configured to be part of multiple groups. With this I can do interesting things: a) The user pays only for the space he occupies, not for read-only snapshots b) The user pays for his space and for all the snapshots c) The user pays for his space and snapshots, but not for snapshots generated for internal backup purposes d) Hierarchical quotas. I can limit /home and set an additional quota on each homedir Thanks, Arne -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik wrote:> > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > that way. This unfortunately will be an incompatible format change, but the > sooner we get this adressed the easier it will be in the long run. Obviously > when I say format change I mean via the incompat bits we have, so old fs''s won''t > be broken and such. > > 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > just do dentry trickery, but that doesn''t make the boundary between subvolumes > clear, so it will confuse people (and samba) when they walk into a subvolume and > all of a sudden the inode numbers are the same as in the directory behind them. > With doing the referral mount thing, each subvolume appears to be its own mount > and that way things like NFS and samba will work properly. >What about the alternative and allocating inode numbers globally? The only problem would be with snapshots as they share the inum with the source, but one could just remap inode numbers in snapshots by sparing some bits at the top of this 64 bit field. Having one mount per subvolume/snapshots is the cleaner solution, but quickly leads to situations where you have _lots_ of mounts, especially when you export them via NFS and mount it somewhere else. I''ve seen a machine which had to handle > 100,000 mounts from a zfs server. This definitely brings it''s own problems, so I''d love to see a full fs exported as a single mount. This will also keep output from tools like iostat (for nfs mounts) and df readable. Thanks, Arne -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Arne Jansen''s message of 2010-12-02 04:49:39 -0500:> Josef Bacik wrote: > > > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > > that way. This unfortunately will be an incompatible format change, but the > > sooner we get this adressed the easier it will be in the long run. Obviously > > when I say format change I mean via the incompat bits we have, so old fs''s won''t > > be broken and such. > > > > 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > > just do dentry trickery, but that doesn''t make the boundary between subvolumes > > clear, so it will confuse people (and samba) when they walk into a subvolume and > > all of a sudden the inode numbers are the same as in the directory behind them. > > With doing the referral mount thing, each subvolume appears to be its own mount > > and that way things like NFS and samba will work properly. > > > > What about the alternative and allocating inode numbers globally? The only > problem would be with snapshots as they share the inum with the source, but > one could just remap inode numbers in snapshots by sparing some bits at the > top of this 64 bit field.The global inode number is possible, it''s just another btree that must be maintained on disk in order to map which inodes are free and which ones aren''t. It also needs to have a reference count on each inode, since each snapshot effectively increases the reference count on every file and directory it contains. The cost of maintaining that reference count is very very high. -chris> > Having one mount per subvolume/snapshots is the cleaner solution, but > quickly leads to situations where you have _lots_ of mounts, especially when > you export them via NFS and mount it somewhere else. I''ve seen a machine > which had to handle > 100,000 mounts from a zfs server. This definitely > brings it''s own problems, so I''d love to see a full fs exported as a single > mount. This will also keep output from tools like iostat (for nfs mounts) > and df readable. > > Thanks, > Arne-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/12/10 16:11, Chris Mason wrote:> Excerpts from Arne Jansen''s message of 2010-12-02 04:49:39 -0500: > >> Josef Bacik wrote: >> >>> 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode >>> to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic >>> that way. This unfortunately will be an incompatible format change, but the >>> sooner we get this adressed the easier it will be in the long run. Obviously >>> when I say format change I mean via the incompat bits we have, so old fs''s won''t >>> be broken and such. >>> >>> 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we >>> just do dentry trickery, but that doesn''t make the boundary between subvolumes >>> clear, so it will confuse people (and samba) when they walk into a subvolume and >>> all of a sudden the inode numbers are the same as in the directory behind them. >>> With doing the referral mount thing, each subvolume appears to be its own mount >>> and that way things like NFS and samba will work properly. >>> >>> >> What about the alternative and allocating inode numbers globally? The only >> problem would be with snapshots as they share the inum with the source, but >> one could just remap inode numbers in snapshots by sparing some bits at the >> top of this 64 bit field. >> > The global inode number is possible, it''s just another btree that must > be maintained on disk in order to map which inodes are free and which > ones aren''t. It also needs to have a reference count on each inode, > since each snapshot effectively increases the reference count on > every file and directory it contains. > > The cost of maintaining that reference count is very very high. >A couple of years ago I was suffering from the problem of different files having the same inode number on Netapp servers. On a Netapp device if you snapshot a volume then the files in the snapshot have the same inode number as the original, even if the original changes. (Netapp snapshots are read only). This means that if you attempt to see what has changed since your last snapshot using a command line such as: diff src/file.c .snapshots/hourly.12/src.file.c Then the diff tool will tell you that the files are the same even if they are different, because it is assuming that files with the same inode number will have identical contents. Therefore I think it is a bad idea if potentially different files on btrfs can have the same inode number. It will break all sorts of tools. Instead of maintaining a big complicated reference count of used inode numbers, could btrfs use bit masks to create a the userland visible inode number from the subvolume id and the real internal inode number. Something like: userland_inode = ( volume_id << 48 ) & internal_inode; Please forgive me if this is impossible, or if that C snippet is syntactically incorrect. I am not a filesystem or kernel developer, and I have not coded in C for many years. -- David Pottage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/02/2010 04:49 AM, Arne Jansen wrote:> What about the alternative and allocating inode numbers globally? The only > problem would be with snapshots as they share the inum with the source, but > one could just remap inode numbers in snapshots by sparing some bits at the > top of this 64 bit field.I was wondering this as well. Why give each subvol its own inode number space? To avoid breaking assumptions of various programs, if they each have their own inode space, they must each have a unique st_dev. How are inode numbers currently allocated, and why wouldn''t it be simple to just have a single pool of inode numbers for all subvols? It seems obvious to me that snapshots start out inheriting the inode numbers of the original subvol, but must be given a new st_dev. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Josef, > 1) Scrap the 256 inode number thing. Instead we''ll just put a > flag in the inode to say "Hey, I''m a subvolume" and then we can > do all of the appropriate magic that way. This unfortunately > will be an incompatible format change, but the sooner we get this > adressed the easier it will be in the long run. Obviously when I > say format change I mean via the incompat bits we have, so old > fs''s won''t be broken and such. Sorry if I''ve missed this elsewhere in the thread -- will we still have an efficient operation for enumerating subvolumes and snapshots, and how will that work? We''re going to want tools like plymouth and grub to be able to list all snapshots without running a large scan. Thanks, - Chris. -- Chris Ball <cjb@laptop.org> One Laptop Per Child -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2010/12/2 David Pottage <david@electric-spoon.com>:> > Therefore I think it is a bad idea if potentially different files on btrfs > can have the same inode number. It will break all sorts of tools. > > Instead of maintaining a big complicated reference count of used inode > numbers, could btrfs use bit masks to create a the userland visible inode > number from the subvolume id and the real internal inode number. Something > like: > > userland_inode = ( volume_id << 48 ) & internal_inode; > > Please forgive me if this is impossible, or if that C snippet is > syntactically incorrect. I am not a filesystem or kernel developer, and I > have not coded in C for many years. > > -- > David Pottage >Expanding on the idea: what about a pool of IDs for subvolumes and inode numbers inside a subvolume having the subvolume ID as a prefix? It gives each inode a unique number, doesn''t require cheating the userland and is less costly than keeping reference count for each inode. The obvious downside that I can see is limitation on number of subvolumes that it would be possible to create. It also lowers the maximum number of inodes in a filesystem (because of bits taken up by subvolume ID). I expect there are also less-than obvious downsides. Just an idea by a kernel and FS ignorant. -- Paweł Brodacki -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball wrote:> Hi Josef, > > > 1) Scrap the 256 inode number thing. Instead we''ll just put a > > flag in the inode to say "Hey, I''m a subvolume" and then we can > > do all of the appropriate magic that way. This unfortunately > > will be an incompatible format change, but the sooner we get this > > adressed the easier it will be in the long run. Obviously when I > > say format change I mean via the incompat bits we have, so old > > fs''s won''t be broken and such. > > Sorry if I''ve missed this elsewhere in the thread -- will we still > have an efficient operation for enumerating subvolumes and snapshots, > and how will that work? We''re going to want tools like plymouth and > grub to be able to list all snapshots without running a large scan. >Yeah the idea is we want to fix the problems with the design without breaking anything that currently works. So all the changes I want to make are going to be invisible for the user. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 05:52:07PM -0800, Michael Vrable wrote:> On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote: > >On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote: > >>I think you''re already fine: > >> > >> # mkdir TMP > >> # dd if=/dev/zero of=TMP-image bs=1M count=512 > >> # mkfs.btrfs TMP-image > >> # mount -oloop TMP-image TMP/ > >> # btrfs subvolume create sub-a > >> # btrfs subvolume create sub-b > >> ../readdir-inos . > >> . 256 256 > >> .. 256 4130609 > >> sub-a 256 256 > >> sub-b 257 256 > >> > >>Where readdir-inos is my silly test program below, and the first > >>number is from readdir, the second from stat. > >> > > > >Heh as soon as I typed my email I went and actually looked at the > >code, looks like for readdir we fill in the root id, which will be > >unique, so hotdamn we are good and I don''t have to use a stupid > >incompat flag. Thanks for checking that :), > > Except, aren''t the inode numbers within a filesystem and the > sunbvolume tree IDs allocated out of separate namespaces? I don''t > think there''s anything preventing a file/directory from having an > inode number that clashes with one of the snapshots. > > In fact, this already happens in the example above: "." (inode 256 > in the root subvolume) and "sub-a" (subvolume ID 256).Oof, yes, I overlooked that.> (Though I still don''t understand the semantics well enough to say > whether we need all the inode numbers returned by readdir to be > distinct.)On normal mounts they''re the number of the inode that was mounted over, so normally they''d be unique across the parent filesystem..... I don''t know if anything depends on that. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 02, 2010 at 05:14:53PM +0000, David Pottage wrote:> A couple of years ago I was suffering from the problem of different > files having the same inode number on Netapp servers. On a Netapp > device if you snapshot a volume then the files in the snapshot have > the same inode number as the original, even if the original changes. > (Netapp snapshots are read only). > > This means that if you attempt to see what has changed since your > last snapshot using a command line such as: > > diff src/file.c .snapshots/hourly.12/src.file.c > > Then the diff tool will tell you that the files are the same even if > they are different, because it is assuming that files with the same > inode number will have identical contents.diff should also recognize when they''re on different filesystem, so this should also be fixable if subvolumes are treated as different filesystem (in the sense that they have different vfsmounts and fsid''s). --b.> > Therefore I think it is a bad idea if potentially different files on > btrfs can have the same inode number. It will break all sorts of > tools. > > Instead of maintaining a big complicated reference count of used > inode numbers, could btrfs use bit masks to create a the userland > visible inode number from the subvolume id and the real internal > inode number. Something like: > > userland_inode = ( volume_id << 48 ) & internal_inode; > > Please forgive me if this is impossible, or if that C snippet is > syntactically incorrect. I am not a filesystem or kernel developer, > and I have not coded in C for many years. > > -- > David Pottage > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:> Hello, > > Various people have complained about how BTRFS deals with subvolumes recently, > specifically the fact that they all have the same inode number, and there''s no > discrete seperation from one subvolume to another. Christoph asked that I lay > out a basic design document of how we want subvolumes to work so we can hash > everything out now, fix what is broken, and then move forward with a design that > everybody is more or less happy with. I apologize in advance for how freaking > long this email is going to be. I assume that most people are generally > familiar with how BTRFS works, so I''m not going to bother explaining in great > detail some stuff. > > === What are subvolumes? ==> > They are just another tree. In BTRFS we have various b-trees to describe the > filesystem. A few of them are filesystem wide, such as the extent tree, chunk > tree, root tree etc. The tree''s that hold the actual filesystem data, that is > inodes and such, are kept in their own b-tree. This is how subvolumes and > snapshots appear on disk, they are simply new b-trees with all of the file data > contained within them. > > === What do subvolumes look like? ==> > All the user sees are directories. They act like any other directory acts, with > a few exceptions > > 1) You cannot hardlink between subvolumes. This is because subvolumes have > their own inode numbers and such, think of them as seperate mounts in this case, > you cannot hardlink between two mounts because the link needs to point to the > same on disk inode, which is impossible between two different filesystems. The > same is true for subvolumes, they have their own trees with their own inodes and > inode numbers, so it''s impossible to hardlink between them. > > 1a) In case it wasn''t clear from above, each subvolume has their own inode > numbers, so you can have the same inode numbers used between two different > subvolumes, since they are two different trees. > > 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > extra metadata to keep track of them, so you have to use one of our ioctls to > delete subvolumes/snapshots. > > But permissions and everything else they are the same. > > There is one tricky thing. When you create a subvolume, the directory inode > that is created in the parent subvolume has the inode number of 256. So if you > have a bunch of subvolumes in the same parent subvolume, you are going to have a > bunch of directories with the inode number of 256. This is so when users cd > into a subvolume we can know its a subvolume and do all the normal voodoo to > start looking in the subvolumes tree instead of the parent subvolumes tree. > > This is where things go a bit sideways. We had serious problems with NFS, but > thankfully NFS gives us a bunch of hooks to get around these problems. > CIFS/Samba do not, so we will have problems there, not to mention any other > userspace application that looks at inode numbers. > > === How do we want subvolumes to work from a user perspective? ==> > 1) Users need to be able to create their own subvolumes. The permission > semantics will be absolutely the same as creating directories, so I don''t think > this is too tricky. We want this because you can only take snapshots of > subvolumes, and so it is important that users be able to create their own > discrete snapshottable targets. > > 2) Users need to be able to snapshot their subvolumes. This is basically the > same as #1, but it bears repeating. > > 3) Subvolumes shouldn''t need to be specifically mounted. This is also > important, we don''t want users to have to go around mounting their subvolumes up > manually one-by-one. Today users just cd into subvolumes and it works, just > like cd''ing into a directory. > > === Quotas ==> > This is a huge topic in and of itself, but Christoph mentioned wanting to have > an idea of what we wanted to do with it, so I''m putting it here. There are > really 2 things here > > 1) Limiting the size of subvolumes. This is really easy for us, just create a > subvolume and at creation time set a maximum size it can grow to and not let it > go farther than that. Nice, simple and straightforward. > > 2) Normal quotas, via the quota tools. This just comes down to how do we want > to charge users, do we want to do it per subvolume, or per filesystem. My vote > is per filesystem. Obviously this will make it tricky with snapshots, but I > think if we''re just charging the diff''s between the original volume and the > snapshot to the user then that will be the easiest for people to understand, > rather than making a snapshot all of a sudden count the users currently used > quota * 2. > > === What do we do? ==> > This is where I expect to see the most discussion. Here is what I want to do > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > that way. This unfortunately will be an incompatible format change, but the > sooner we get this adressed the easier it will be in the long run. Obviously > when I say format change I mean via the incompat bits we have, so old fs''s won''t > be broken and such. > > 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > just do dentry trickery, but that doesn''t make the boundary between subvolumes > clear, so it will confuse people (and samba) when they walk into a subvolume and > all of a sudden the inode numbers are the same as in the directory behind them. > With doing the referral mount thing, each subvolume appears to be its own mount > and that way things like NFS and samba will work properly. > > I feel like I''m forgetting something here, hopefully somebody will point it out. > > === Conclusion ==> > There are definitely some wonky things with subvolumes, but I don''t think they > are things that cannot be fixed now. Some of these changes will require > incompat format changes, but it''s either we fix it now, or later on down the > road when BTRFS starts getting used in production really find out how many > things our current scheme breaks and then have to do the changes then. Thanks, >So now that I''ve actually looked at everything, it looks like the semantics are all right for subvolumes 1) readdir - we return the root id in d_ino, which is unique across the fs 2) stat - we return 256 for all subvolumes, because that is their inode number 3) dev_t - we setup an anon super for all volumes, so they all get their own dev_t, which is set properly for all of their children, see below [root@test1244 btrfs-test]# stat . File: `.'' Size: 20 Blocks: 8 IO Block: 4096 directory Device: 15h/21d Inode: 256 Links: 1 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-12-03 15:35:41.931679393 -0500 Modify: 2010-12-03 15:35:20.405679493 -0500 Change: 2010-12-03 15:35:20.405679493 -0500 [root@test1244 btrfs-test]# stat foo File: `foo'' Size: 12 Blocks: 0 IO Block: 4096 directory Device: 19h/25d Inode: 256 Links: 1 Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-12-03 15:35:17.501679393 -0500 Modify: 2010-12-03 15:35:59.150680051 -0500 Change: 2010-12-03 15:35:59.150680051 -0500 [root@test1244 btrfs-test]# stat foo/foobar File: `foo/foobar'' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 19h/25d Inode: 257 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-12-03 15:35:59.150680051 -0500 Modify: 2010-12-03 15:35:59.150680051 -0500 Change: 2010-12-03 15:35:59.150680051 -0500 So as far as the user is concerned, everything should come out right. Obviously we had to do the NFS trickery still because as far as VFS is concerned the subvolumes are all on the same mount. So the question is this (and really this is directed at Christoph and Bruce and anybody else who may care), is this good enough, or do we want to have a seperate vfsmount for each subvolume? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:> So now that I''ve actually looked at everything, it looks like the semantics are > all right for subvolumes > > 1) readdir - we return the root id in d_ino, which is unique across the fsThough Michael Vrable pointed out an apparent collision with "normal" inode numbers on the parent filesystem?> 2) stat - we return 256 for all subvolumes, because that is their inode number > 3) dev_t - we setup an anon super for all volumes, so they all get their own > dev_t, which is set properly for all of their children, see below > > [root@test1244 btrfs-test]# stat . > File: `.'' > Size: 20 Blocks: 8 IO Block: 4096 directory > Device: 15h/21d Inode: 256 Links: 1 > Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:41.931679393 -0500 > Modify: 2010-12-03 15:35:20.405679493 -0500 > Change: 2010-12-03 15:35:20.405679493 -0500 > > [root@test1244 btrfs-test]# stat foo > File: `foo'' > Size: 12 Blocks: 0 IO Block: 4096 directory > Device: 19h/25d Inode: 256 Links: 1 > Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:17.501679393 -0500 > Modify: 2010-12-03 15:35:59.150680051 -0500 > Change: 2010-12-03 15:35:59.150680051 -0500 > > [root@test1244 btrfs-test]# stat foo/foobar > File: `foo/foobar'' > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: 19h/25d Inode: 257 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:59.150680051 -0500 > Modify: 2010-12-03 15:35:59.150680051 -0500 > Change: 2010-12-03 15:35:59.150680051 -0500 > > So as far as the user is concerned, everything should come out right. Obviously > we had to do the NFS trickery still because as far as VFS is concerned the > subvolumes are all on the same mount. So the question is this (and really this > is directed at Christoph and Bruce and anybody else who may care), is this good > enough, or do we want to have a seperate vfsmount for each subvolume? Thanks,For nfsd''s purposes, we need to be able find out about filesystems in two different ways: 1. Lookup by filehandle: we need to be able to identify which subvolume we''re dealing with from a filehandle. 2. Lookup by path: we need to notice when we cross into a subvolume. Looks like #1 already works. Not #2: the current nfsd code just checks for mountpoints. We could modify nfsd to also check whether dev_t changed each time it did a lookup. I suppose it would work, though it''s annoying to have to do it just for the case of btrfs. As far as I can tell, crossing into a subvolume is like crossing a mountpoint in every way except for the lack of a separate vfsmount. I''d worry that the inconsistency will end up requiring more special cases down the road, but I don''t have any in mind. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > Hello, > > > > Various people have complained about how BTRFS deals with subvolumes recently, > > specifically the fact that they all have the same inode number, and there''s no > > discrete seperation from one subvolume to another. Christoph asked that I lay > > out a basic design document of how we want subvolumes to work so we can hash > > everything out now, fix what is broken, and then move forward with a design that > > everybody is more or less happy with. I apologize in advance for how freaking > > long this email is going to be. I assume that most people are generally > > familiar with how BTRFS works, so I''m not going to bother explaining in great > > detail some stuff. > >....> > are things that cannot be fixed now. Some of these changes will require > > incompat format changes, but it''s either we fix it now, or later on down the > > road when BTRFS starts getting used in production really find out how many > > things our current scheme breaks and then have to do the changes then. Thanks, > > > > So now that I''ve actually looked at everything, it looks like the semantics are > all right for subvolumes > > 1) readdir - we return the root id in d_ino, which is unique across the fs > 2) stat - we return 256 for all subvolumes, because that is their inode number > 3) dev_t - we setup an anon super for all volumes, so they all get their own > dev_t, which is set properly for all of their children, see belowA property of NFS fileshandles is that they must be stable across server reboots. Is this anon dev_t used as part of the NFS filehandle and if so how can you guarantee that it is stable? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Dave Chinner''s message of 2010-12-03 17:27:56 -0500:> On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > > Hello, > > > > > > Various people have complained about how BTRFS deals with subvolumes recently, > > > specifically the fact that they all have the same inode number, and there''s no > > > discrete seperation from one subvolume to another. Christoph asked that I lay > > > out a basic design document of how we want subvolumes to work so we can hash > > > everything out now, fix what is broken, and then move forward with a design that > > > everybody is more or less happy with. I apologize in advance for how freaking > > > long this email is going to be. I assume that most people are generally > > > familiar with how BTRFS works, so I''m not going to bother explaining in great > > > detail some stuff. > > > > .... > > > are things that cannot be fixed now. Some of these changes will require > > > incompat format changes, but it''s either we fix it now, or later on down the > > > road when BTRFS starts getting used in production really find out how many > > > things our current scheme breaks and then have to do the changes then. Thanks, > > > > > > > So now that I''ve actually looked at everything, it looks like the semantics are > > all right for subvolumes > > > > 1) readdir - we return the root id in d_ino, which is unique across the fs > > 2) stat - we return 256 for all subvolumes, because that is their inode number > > 3) dev_t - we setup an anon super for all volumes, so they all get their own > > dev_t, which is set properly for all of their children, see below > > A property of NFS fileshandles is that they must be stable across > server reboots. Is this anon dev_t used as part of the NFS > filehandle and if so how can you guarantee that it is stable?It isn''t today, that''s something we''ll have to address. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote:> Excerpts from Dave Chinner''s message of 2010-12-03 17:27:56 -0500: > > On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote: > > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > > > Hello, > > > > > > > > Various people have complained about how BTRFS deals with subvolumes recently, > > > > specifically the fact that they all have the same inode number, and there''s no > > > > discrete seperation from one subvolume to another. Christoph asked that I lay > > > > out a basic design document of how we want subvolumes to work so we can hash > > > > everything out now, fix what is broken, and then move forward with a design that > > > > everybody is more or less happy with. I apologize in advance for how freaking > > > > long this email is going to be. I assume that most people are generally > > > > familiar with how BTRFS works, so I''m not going to bother explaining in great > > > > detail some stuff. > > > > > > .... > > > > are things that cannot be fixed now. Some of these changes will require > > > > incompat format changes, but it''s either we fix it now, or later on down the > > > > road when BTRFS starts getting used in production really find out how many > > > > things our current scheme breaks and then have to do the changes then. Thanks, > > > > > > > > > > So now that I''ve actually looked at everything, it looks like the semantics are > > > all right for subvolumes > > > > > > 1) readdir - we return the root id in d_ino, which is unique across the fs > > > 2) stat - we return 256 for all subvolumes, because that is their inode number > > > 3) dev_t - we setup an anon super for all volumes, so they all get their own > > > dev_t, which is set properly for all of their children, see below > > > > A property of NFS fileshandles is that they must be stable across > > server reboots. Is this anon dev_t used as part of the NFS > > filehandle and if so how can you guarantee that it is stable? > > It isn''t today, that''s something we''ll have to address.We''re using statfs64.fs_fsid for this; I believe that''s both stable across reboots and distinguishes between subvolumes, so that''s OK. (That said, since fs_fsid doesn''t work for other filesystems, we depend on an explicit check for a filesystem type of "btrfs", which is awful--btrfs won''t always be the only filesystem that wants to do this kind of thing, etc.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-12-03, at 15:45, J. Bruce Fields wrote:> We''re using statfs64.fs_fsid for this; I believe that''s both stable > across reboots and distinguishes between subvolumes, so that''s OK. > > (That said, since fs_fsid doesn''t work for other filesystems, we depend > on an explicit check for a filesystem type of "btrfs", which is > awful--btrfs won''t always be the only filesystem that wants to do this > kind of thing, etc.)Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward. Any chance we can add a ->get_fsid(sb, inode) method to export_operations (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com> wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: >> Hello, >> >> Various people have complained about how BTRFS deals with subvolumes recently, >> specifically the fact that they all have the same inode number, and there''s no >> discrete seperation from one subvolume to another. Christoph asked that I lay >> out a basic design document of how we want subvolumes to work so we can hash >> everything out now, fix what is broken, and then move forward with a design that >> everybody is more or less happy with. I apologize in advance for how freaking >> long this email is going to be. I assume that most people are generally >> familiar with how BTRFS works, so I''m not going to bother explaining in great >> detail some stuff. >> >> === What are subvolumes? ==>> >> They are just another tree. In BTRFS we have various b-trees to describe the >> filesystem. A few of them are filesystem wide, such as the extent tree, chunk >> tree, root tree etc. The tree''s that hold the actual filesystem data, that is >> inodes and such, are kept in their own b-tree. This is how subvolumes and >> snapshots appear on disk, they are simply new b-trees with all of the file data >> contained within them. >> >> === What do subvolumes look like? ==>> >> All the user sees are directories. They act like any other directory acts, with >> a few exceptions >> >> 1) You cannot hardlink between subvolumes. This is because subvolumes have >> their own inode numbers and such, think of them as seperate mounts in this case, >> you cannot hardlink between two mounts because the link needs to point to the >> same on disk inode, which is impossible between two different filesystems. The >> same is true for subvolumes, they have their own trees with their own inodes and >> inode numbers, so it''s impossible to hardlink between them. >> >> 1a) In case it wasn''t clear from above, each subvolume has their own inode >> numbers, so you can have the same inode numbers used between two different >> subvolumes, since they are two different trees. >> >> 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s >> extra metadata to keep track of them, so you have to use one of our ioctls to >> delete subvolumes/snapshots. >> >> But permissions and everything else they are the same. >> >> There is one tricky thing. When you create a subvolume, the directory inode >> that is created in the parent subvolume has the inode number of 256. So if you >> have a bunch of subvolumes in the same parent subvolume, you are going to have a >> bunch of directories with the inode number of 256. This is so when users cd >> into a subvolume we can know its a subvolume and do all the normal voodoo to >> start looking in the subvolumes tree instead of the parent subvolumes tree. >> >> This is where things go a bit sideways. We had serious problems with NFS, but >> thankfully NFS gives us a bunch of hooks to get around these problems. >> CIFS/Samba do not, so we will have problems there, not to mention any other >> userspace application that looks at inode numbers. >> >> === How do we want subvolumes to work from a user perspective? ==>> >> 1) Users need to be able to create their own subvolumes. The permission >> semantics will be absolutely the same as creating directories, so I don''t think >> this is too tricky. We want this because you can only take snapshots of >> subvolumes, and so it is important that users be able to create their own >> discrete snapshottable targets. >> >> 2) Users need to be able to snapshot their subvolumes. This is basically the >> same as #1, but it bears repeating. >> >> 3) Subvolumes shouldn''t need to be specifically mounted. This is also >> important, we don''t want users to have to go around mounting their subvolumes up >> manually one-by-one. Today users just cd into subvolumes and it works, just >> like cd''ing into a directory. >> >> === Quotas ==>> >> This is a huge topic in and of itself, but Christoph mentioned wanting to have >> an idea of what we wanted to do with it, so I''m putting it here. There are >> really 2 things here >> >> 1) Limiting the size of subvolumes. This is really easy for us, just create a >> subvolume and at creation time set a maximum size it can grow to and not let it >> go farther than that. Nice, simple and straightforward. >> >> 2) Normal quotas, via the quota tools. This just comes down to how do we want >> to charge users, do we want to do it per subvolume, or per filesystem. My vote >> is per filesystem. Obviously this will make it tricky with snapshots, but I >> think if we''re just charging the diff''s between the original volume and the >> snapshot to the user then that will be the easiest for people to understand, >> rather than making a snapshot all of a sudden count the users currently used >> quota * 2. >> >> === What do we do? ==>> >> This is where I expect to see the most discussion. Here is what I want to do >> >> 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode >> to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic >> that way. This unfortunately will be an incompatible format change, but the >> sooner we get this adressed the easier it will be in the long run. Obviously >> when I say format change I mean via the incompat bits we have, so old fs''s won''t >> be broken and such. >> >> 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we >> just do dentry trickery, but that doesn''t make the boundary between subvolumes >> clear, so it will confuse people (and samba) when they walk into a subvolume and >> all of a sudden the inode numbers are the same as in the directory behind them. >> With doing the referral mount thing, each subvolume appears to be its own mount >> and that way things like NFS and samba will work properly. >> >> I feel like I''m forgetting something here, hopefully somebody will point it out. >> >> === Conclusion ==>> >> There are definitely some wonky things with subvolumes, but I don''t think they >> are things that cannot be fixed now. Some of these changes will require >> incompat format changes, but it''s either we fix it now, or later on down the >> road when BTRFS starts getting used in production really find out how many >> things our current scheme breaks and then have to do the changes then. Thanks, >> > > So now that I''ve actually looked at everything, it looks like the semantics are > all right for subvolumes > > 1) readdir - we return the root id in d_ino, which is unique across the fs > 2) stat - we return 256 for all subvolumes, because that is their inode number > 3) dev_t - we setup an anon super for all volumes, so they all get their own > dev_t, which is set properly for all of their children, see below > > [root@test1244 btrfs-test]# stat . > File: `.'' > Size: 20 Blocks: 8 IO Block: 4096 directory > Device: 15h/21d Inode: 256 Links: 1 > Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:41.931679393 -0500 > Modify: 2010-12-03 15:35:20.405679493 -0500 > Change: 2010-12-03 15:35:20.405679493 -0500 > > [root@test1244 btrfs-test]# stat foo > File: `foo'' > Size: 12 Blocks: 0 IO Block: 4096 directory > Device: 19h/25d Inode: 256 Links: 1 > Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:17.501679393 -0500 > Modify: 2010-12-03 15:35:59.150680051 -0500 > Change: 2010-12-03 15:35:59.150680051 -0500 > > [root@test1244 btrfs-test]# stat foo/foobar > File: `foo/foobar'' > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: 19h/25d Inode: 257 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-12-03 15:35:59.150680051 -0500 > Modify: 2010-12-03 15:35:59.150680051 -0500 > Change: 2010-12-03 15:35:59.150680051 -0500 > > So as far as the user is concerned, everything should come out right. Obviously > we had to do the NFS trickery still because as far as VFS is concerned the > subvolumes are all on the same mount. So the question is this (and really this > is directed at Christoph and Bruce and anybody else who may care), is this good > enough, or do we want to have a seperate vfsmount for each subvolume? Thanks, >What are the drawbacks of having a vfsmount for each subvolume? Why (besides having to code it up) are you trying to avoid doing it that way? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:> On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com> wrote: > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > >> Hello, > >> > >> Various people have complained about how BTRFS deals with subvolumes recently, > >> specifically the fact that they all have the same inode number, and there''s no > >> discrete seperation from one subvolume to another. Christoph asked that I lay > >> out a basic design document of how we want subvolumes to work so we can hash > >> everything out now, fix what is broken, and then move forward with a design that > >> everybody is more or less happy with. I apologize in advance for how freaking > >> long this email is going to be. I assume that most people are generally > >> familiar with how BTRFS works, so I''m not going to bother explaining in great > >> detail some stuff. > >> > >> === What are subvolumes? ==> >> > >> They are just another tree. In BTRFS we have various b-trees to describe the > >> filesystem. A few of them are filesystem wide, such as the extent tree, chunk > >> tree, root tree etc. The tree''s that hold the actual filesystem data, that is > >> inodes and such, are kept in their own b-tree. This is how subvolumes and > >> snapshots appear on disk, they are simply new b-trees with all of the file data > >> contained within them. > >> > >> === What do subvolumes look like? ==> >> > >> All the user sees are directories. They act like any other directory acts, with > >> a few exceptions > >> > >> 1) You cannot hardlink between subvolumes. This is because subvolumes have > >> their own inode numbers and such, think of them as seperate mounts in this case, > >> you cannot hardlink between two mounts because the link needs to point to the > >> same on disk inode, which is impossible between two different filesystems. The > >> same is true for subvolumes, they have their own trees with their own inodes and > >> inode numbers, so it''s impossible to hardlink between them. > >> > >> 1a) In case it wasn''t clear from above, each subvolume has their own inode > >> numbers, so you can have the same inode numbers used between two different > >> subvolumes, since they are two different trees. > >> > >> 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > >> extra metadata to keep track of them, so you have to use one of our ioctls to > >> delete subvolumes/snapshots. > >> > >> But permissions and everything else they are the same. > >> > >> There is one tricky thing. When you create a subvolume, the directory inode > >> that is created in the parent subvolume has the inode number of 256. So if you > >> have a bunch of subvolumes in the same parent subvolume, you are going to have a > >> bunch of directories with the inode number of 256. This is so when users cd > >> into a subvolume we can know its a subvolume and do all the normal voodoo to > >> start looking in the subvolumes tree instead of the parent subvolumes tree. > >> > >> This is where things go a bit sideways. We had serious problems with NFS, but > >> thankfully NFS gives us a bunch of hooks to get around these problems. > >> CIFS/Samba do not, so we will have problems there, not to mention any other > >> userspace application that looks at inode numbers. > >> > >> === How do we want subvolumes to work from a user perspective? ==> >> > >> 1) Users need to be able to create their own subvolumes. The permission > >> semantics will be absolutely the same as creating directories, so I don''t think > >> this is too tricky. We want this because you can only take snapshots of > >> subvolumes, and so it is important that users be able to create their own > >> discrete snapshottable targets. > >> > >> 2) Users need to be able to snapshot their subvolumes. This is basically the > >> same as #1, but it bears repeating. > >> > >> 3) Subvolumes shouldn''t need to be specifically mounted. This is also > >> important, we don''t want users to have to go around mounting their subvolumes up > >> manually one-by-one. Today users just cd into subvolumes and it works, just > >> like cd''ing into a directory. > >> > >> === Quotas ==> >> > >> This is a huge topic in and of itself, but Christoph mentioned wanting to have > >> an idea of what we wanted to do with it, so I''m putting it here. There are > >> really 2 things here > >> > >> 1) Limiting the size of subvolumes. This is really easy for us, just create a > >> subvolume and at creation time set a maximum size it can grow to and not let it > >> go farther than that. Nice, simple and straightforward. > >> > >> 2) Normal quotas, via the quota tools. This just comes down to how do we want > >> to charge users, do we want to do it per subvolume, or per filesystem. My vote > >> is per filesystem. Obviously this will make it tricky with snapshots, but I > >> think if we''re just charging the diff''s between the original volume and the > >> snapshot to the user then that will be the easiest for people to understand, > >> rather than making a snapshot all of a sudden count the users currently used > >> quota * 2. > >> > >> === What do we do? ==> >> > >> This is where I expect to see the most discussion. Here is what I want to do > >> > >> 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > >> to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > >> that way. This unfortunately will be an incompatible format change, but the > >> sooner we get this adressed the easier it will be in the long run. Obviously > >> when I say format change I mean via the incompat bits we have, so old fs''s won''t > >> be broken and such. > >> > >> 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > >> just do dentry trickery, but that doesn''t make the boundary between subvolumes > >> clear, so it will confuse people (and samba) when they walk into a subvolume and > >> all of a sudden the inode numbers are the same as in the directory behind them. > >> With doing the referral mount thing, each subvolume appears to be its own mount > >> and that way things like NFS and samba will work properly. > >> > >> I feel like I''m forgetting something here, hopefully somebody will point it out. > >> > >> === Conclusion ==> >> > >> There are definitely some wonky things with subvolumes, but I don''t think they > >> are things that cannot be fixed now. Some of these changes will require > >> incompat format changes, but it''s either we fix it now, or later on down the > >> road when BTRFS starts getting used in production really find out how many > >> things our current scheme breaks and then have to do the changes then. Thanks, > >> > > > > So now that I''ve actually looked at everything, it looks like the semantics are > > all right for subvolumes > > > > 1) readdir - we return the root id in d_ino, which is unique across the fs > > 2) stat - we return 256 for all subvolumes, because that is their inode number > > 3) dev_t - we setup an anon super for all volumes, so they all get their own > > dev_t, which is set properly for all of their children, see below > > > > [root@test1244 btrfs-test]# stat . > > File: `.'' > > Size: 20 Blocks: 8 IO Block: 4096 directory > > Device: 15h/21d Inode: 256 Links: 1 > > Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2010-12-03 15:35:41.931679393 -0500 > > Modify: 2010-12-03 15:35:20.405679493 -0500 > > Change: 2010-12-03 15:35:20.405679493 -0500 > > > > [root@test1244 btrfs-test]# stat foo > > File: `foo'' > > Size: 12 Blocks: 0 IO Block: 4096 directory > > Device: 19h/25d Inode: 256 Links: 1 > > Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2010-12-03 15:35:17.501679393 -0500 > > Modify: 2010-12-03 15:35:59.150680051 -0500 > > Change: 2010-12-03 15:35:59.150680051 -0500 > > > > [root@test1244 btrfs-test]# stat foo/foobar > > File: `foo/foobar'' > > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > > Device: 19h/25d Inode: 257 Links: 1 > > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2010-12-03 15:35:59.150680051 -0500 > > Modify: 2010-12-03 15:35:59.150680051 -0500 > > Change: 2010-12-03 15:35:59.150680051 -0500 > > > > So as far as the user is concerned, everything should come out right. Obviously > > we had to do the NFS trickery still because as far as VFS is concerned the > > subvolumes are all on the same mount. So the question is this (and really this > > is directed at Christoph and Bruce and anybody else who may care), is this good > > enough, or do we want to have a seperate vfsmount for each subvolume? Thanks, > > > > What are the drawbacks of having a vfsmount for each subvolume? > > Why (besides having to code it up) are you trying to avoid doing it that way?It''s the having to code it up that way thing, I''m nothing if not lazy. Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:> On 2010-12-03, at 15:45, J. Bruce Fields wrote: > > We''re using statfs64.fs_fsid for this; I believe that''s both stable > > across reboots and distinguishes between subvolumes, so that''s OK. > > > > (That said, since fs_fsid doesn''t work for other filesystems, we depend > > on an explicit check for a filesystem type of "btrfs", which is > > awful--btrfs won''t always be the only filesystem that wants to do this > > kind of thing, etc.) > > Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward. > > Any chance we can add a ->get_fsid(sb, inode) method to export_operations > (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported?No objection from here. (Though I don''t understand the inode argument--aren''t "subvolumes" usually expected to have separate superblocks?) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> === What do subvolumes look like? ==> > All the user sees are directories. They act like any other directory acts, with > a few exceptions > > 1) You cannot hardlink between subvolumes. This is because subvolumes have > their own inode numbers and such, think of them as seperate mounts in this case, > you cannot hardlink between two mounts because the link needs to point to the > same on disk inode, which is impossible between two different filesystems. The > same is true for subvolumes, they have their own trees with their own inodes and > inode numbers, so it''s impossible to hardlink between them.which means they act like a different mount point.> 1a) In case it wasn''t clear from above, each subvolume has their own inode > numbers, so you can have the same inode numbers used between two different > subvolumes, since they are two different trees.which means they act like not just a different mount point, but they also act like beeing a separate superblock.> 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > extra metadata to keep track of them, so you have to use one of our ioctls to > delete subvolumes/snapshots.Again this means they act like a mount point.> 1) Users need to be able to create their own subvolumes. The permission > semantics will be absolutely the same as creating directories, so I don''t think > this is too tricky. We want this because you can only take snapshots of > subvolumes, and so it is important that users be able to create their own > discrete snapshottable targets.Not that I''m entirely against this, but instead of just stating they must can you also state the detailed reason? Allowing users to create your subvolumes is a mostly equivalent problem to allowing user mounts, so handling those two under one umbrella makes a lot of sense.> This is where I expect to see the most discussion. Here is what I want to do > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > that way. This unfortunately will be an incompatible format change, but the > sooner we get this adressed the easier it will be in the long run. Obviously > when I say format change I mean via the incompat bits we have, so old fs''s won''t > be broken and such.From reading later post in this threads readddir already seems to take care of this in some way. But is there a chance of collisions between real inode numbers and the ones faked up for the subvolume roots?> 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > just do dentry trickery, but that doesn''t make the boundary between subvolumes > clear, so it will confuse people (and samba) when they walk into a subvolume and > all of a sudden the inode numbers are the same as in the directory behind them. > With doing the referral mount thing, each subvolume appears to be its own mount > and that way things like NFS and samba will work properly. > > I feel like I''m forgetting something here, hopefully somebody will point it out.The current code requires the automount trigger points to be links, which is something that Chris didn''t like at all. But that issue is solved by building upong David Howell''s series to replace that follow_link magic with a new d_automount dentry operation. I''d suggest building the new code on top of that. And most importantly: 3) allocate a different anon dev_t for each subvolume. One thing that really confuses me is that the the actual root of the subvolume appears directly in the parent namespace. Given that you have your subvolume identifiers that doesn''t even seems nessecary. To me the following scheme seems more useful: - all subvolumes/snapshots only show up in a virtual below-root directory, similar to how the existing "default" one doesn''t sit on the top. - the entries inside a namespace that are to be automounted have an entry in the filesystem that just marks them as an auto-mount point that redirects to the actual subvolume. - we still allow mounting subvolumes (and only those) directly from get_sb by specifying the subvolume name. This is especially important for snapshots, as just having them hang off the filesystem that is to be snapshotted is extremly confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:> A property of NFS fileshandles is that they must be stable across > server reboots. Is this anon dev_t used as part of the NFS > filehandle and if so how can you guarantee that it is stable?It''s just as stable as a real dev_t in the times of hotplug and udev. As long as you don''t touch anything including not upgrading the kernel it''s remain stable, otherwise it will break. That''s why modern nfs-utils default to using the uuid-based filehandle schemes instead of the dev_t based ones. At least that''s what I told - I really hope it''s using the real UUIDs from the filesystem and not the horrible fsid hack that was once added - for some filesystems like XFS that field does not actually have any relation to the UUID historically. And while we could have changed that it''s too late now that nfs was hacked into abusing that field. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:> We''re using statfs64.fs_fsid for this; I believe that''s both stable > across reboots and distinguishes between subvolumes, so that''s OK.It''s a field that doesn''t have any useful specification and basically contains random garbage that a filesystem put into it. Using it is a very bad idea. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:> On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote: > > A property of NFS fileshandles is that they must be stable across > > server reboots. Is this anon dev_t used as part of the NFS > > filehandle and if so how can you guarantee that it is stable? > > It''s just as stable as a real dev_t in the times of hotplug and udev. > As long as you don''t touch anything including not upgrading the kernel > it''s remain stable, otherwise it will break. That''s why modern > nfs-utils default to using the uuid-based filehandle schemes instead of > the dev_t based ones. At least that''s what I told - I really hope it''s > using the real UUIDs from the filesystem and not the horrible fsid hack > that was once added - for some filesystems like XFS that field does not > actually have any relation to the UUID historically. And while we could > have changed that it''s too late now that nfs was hacked into abusing > that field.IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but they won''t fit into the NFSv2 32-byte filehandles, so there is an ''8-byte fsid'' and ''4-byte fsid + inode number'' workaround for that... See the mk_fsid() helper in fs/nfsd/nfsfh.h Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 07, 2010 at 05:52:13PM +0100, hch wrote:> On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote: > > We''re using statfs64.fs_fsid for this; I believe that''s both stable > > across reboots and distinguishes between subvolumes, so that''s OK. > > It''s a field that doesn''t have any useful specification and basically > contains random garbage that a filesystem put into it. Using it is a > very bad idea.I meant the above statement to apply only to btrfs; and nfs-utils is using fs_fsid only in the case where the filesystem type is "btrfs". So I believe the current code does work. But I agree that constructing filehandles differently based on a strcmp() of the filesystem type is not a sustainable design, to say the least. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-12-06, at 09:48, J. Bruce Fields wrote: On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:>> Any chance we can add a ->get_fsid(sb, inode) method to >> export_operations (or something simiar), that allows the filesystem to >> generate an FSID based on the volume and inode that is being exported? > > No objection from here. > > (Though I don''t understand the inode argument--aren''t "subvolumes" > usually expected to have separate superblocks?)I thought that if two directories from the same filesystem are both being exported at the same time that they would need to have different FSID values, hence the inode parameter to allow generating an FSID that is a function of both the filesystem (sb) and the directory being exported (inode)? Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-12-07, at 10:02, Trond Myklebust wrote:> On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote: >> It''s just as stable as a real dev_t in the times of hotplug and udev. >> As long as you don''t touch anything including not upgrading the kernel >> it''s remain stable, otherwise it will break. That''s why modern >> nfs-utils default to using the uuid-based filehandle schemes instead of >> the dev_t based ones. At least that''s what I told - I really hope it''s >> using the real UUIDs from the filesystem and not the horrible fsid hack >> that was once added - for some filesystems like XFS that field does not >> actually have any relation to the UUID historically. And while we >> could have changed that it''s too late now that nfs was hacked into >> abusing that field. > > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but > they won''t fit into the NFSv2 32-byte filehandles, so there is an > ''8-byte fsid'' and ''4-byte fsid + inode number'' workaround for that... > > See the mk_fsid() helper in fs/nfsd/nfsfh.hIt looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option). There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:> On 2010-12-07, at 10:02, Trond Myklebust wrote: > > > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote: > >> It''s just as stable as a real dev_t in the times of hotplug and udev. > >> As long as you don''t touch anything including not upgrading the kernel > >> it''s remain stable, otherwise it will break. That''s why modern > >> nfs-utils default to using the uuid-based filehandle schemes instead of > >> the dev_t based ones. At least that''s what I told - I really hope it''s > >> using the real UUIDs from the filesystem and not the horrible fsid hack > >> that was once added - for some filesystems like XFS that field does not > >> actually have any relation to the UUID historically. And while we > >> could have changed that it''s too late now that nfs was hacked into > >> abusing that field. > > > > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but > > they won''t fit into the NFSv2 32-byte filehandles, so there is an > > ''8-byte fsid'' and ''4-byte fsid + inode number'' workaround for that... > > > > See the mk_fsid() helper in fs/nfsd/nfsfh.h > > It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option).No, if you look at the nfs-utils source you''ll find mountd sets a uuid by default (in utils/mountd/cache.c:uuid_by_path()).> There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file.Agreed that doing this in the kernel would probably be simpler. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-12-08, at 10:27, J. Bruce Fields wrote:> On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote: >> It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option). > > No, if you look at the nfs-utils source you''ll find mountd sets a uuid > by default (in utils/mountd/cache.c:uuid_by_path()).Unfortunately, this only works for block devices, not network filesystems.>> There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file. > > Agreed that doing this in the kernel would probably be simpler.Agreed. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <bfields@redhat.com> wrote:> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: > > On 2010-12-03, at 15:45, J. Bruce Fields wrote: > > > We''re using statfs64.fs_fsid for this; I believe that''s both stable > > > across reboots and distinguishes between subvolumes, so that''s OK. > > > > > > (That said, since fs_fsid doesn''t work for other filesystems, we depend > > > on an explicit check for a filesystem type of "btrfs", which is > > > awful--btrfs won''t always be the only filesystem that wants to do this > > > kind of thing, etc.) > > > > Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward. > > > > Any chance we can add a ->get_fsid(sb, inode) method to export_operations > > (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? > > No objection from here.My standard objection here is that you cannot guarantee that the fsid is 100% guarantied to be unique across all filesystems in the system (including filesystems mounted from dm snapshots of filesystems that are currently mounted). NFSd needs this uniqueness. This is only really an objection if user-space cannot over-ride the fsid provided by the filesystem. I''d be very happy to see an interface to user-space whereby user-space can get a reasonably unique fsid for a given filesystem. Whether this is an export_operations method or some field in the ''struct super'' which gets copied out doesn''t matter to me. NeilBrown> > (Though I don''t understand the inode argument--aren''t "subvolumes" > usually expected to have separate superblocks?) > > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-12-08, at 16:07, Neil Brown wrote:> On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <bfields@redhat.com> > wrote: > >> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: >>> Any chance we can add a ->get_fsid(sb, inode) method to >>> export_operations (or something simiar), that allows the >>> filesystem to generate an FSID based on the volume and >>> inode that is being exported? >> >> No objection from here. > > My standard objection here is that you cannot guarantee that the > fsid is 100% guarantied to be unique across all filesystems in > the system (including filesystems mounted from dm snapshots of > filesystems that are currently mounted). NFSd needs this uniqueness.Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value...> This is only really an objection if user-space cannot over-ride > the fsid provided by the filesystem.Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding "btrfs" filesystems).> I''d be very happy to see an interface to user-space whereby > user-space can get a reasonably unique fsid for a given > filesystem.Hmm, maybe I''m missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a "uuid=" option is present in the exports file that is preferentially used over the value from the filesystem. That said, I think Aneesh''s open_by_handle patchset also made the UUID visible in /proc/<pid>/mountinfo, after the filesystems stored it in sb->s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid.> Whether this is an export_operations method or some field in the > ''struct super'' which gets copied out doesn''t matter to me.Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29): [PATCH -V22 12/14] vfs: Export file system uuid via /proc/<pid>/mountinfo [PATCH -V22 13/14] ext3: Copy fs UUID to superblock. [PATCH -V22 14/14] ext4: Copy fs UUID to superblock Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 08, 2010 at 09:41:33PM -0700, Andreas Dilger wrote:> On 2010-12-08, at 16:07, Neil Brown wrote: > > On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <bfields@redhat.com> > > wrote: > > > >> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: > >>> Any chance we can add a ->get_fsid(sb, inode) method to > >>> export_operations (or something simiar), that allows the > >>> filesystem to generate an FSID based on the volume and > >>> inode that is being exported? > >> > >> No objection from here. > > > > My standard objection here is that you cannot guarantee that the > > fsid is 100% guarantied to be unique across all filesystems in > > the system (including filesystems mounted from dm snapshots of > > filesystems that are currently mounted). NFSd needs this uniqueness. > > Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value... > > > This is only really an objection if user-space cannot over-ride > > the fsid provided by the filesystem. > > Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding "btrfs" filesystems). > > > I''d be very happy to see an interface to user-space whereby > > user-space can get a reasonably unique fsid for a given > > filesystem. > > Hmm, maybe I''m missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a "uuid=" option is present in the exports file that is preferentially used over the value from the filesystem.Well, the kernel can''t distinguish the case of an explicit "uuid=" option in /etc/exports from one that was (as is the normal default) generated automatically by mountd. Maybe not a big deal. The uuid seems like a useful thing to have access to from userspace anyway, for userspace nfs servers if for no other reason:> That said, I think Aneesh''s open_by_handle patchset also made the UUID visible in /proc/<pid>/mountinfo, after the filesystems stored it in > sb->s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid. > > > Whether this is an export_operations method or some field in the > > ''struct super'' which gets copied out doesn''t matter to me. > > Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29): > > [PATCH -V22 12/14] vfs: Export file system uuid via /proc/<pid>/mountinfo > [PATCH -V22 13/14] ext3: Copy fs UUID to superblock. > [PATCH -V22 14/14] ext4: Copy fs UUID to superblockI can''t see anything wrong with that. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Mittwoch 01 Dezember 2010 schrieb Mike Hommey:> On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote: > > Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55-0500:> > > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com>wrote:> > > > === How do we want subvolumes to work from a user perspective? > > > > ==> > > > > > > > 1) Users need to be able to create their own subvolumes. Â The > > > > permission semantics will be absolutely the same as creating > > > > directories, so I don''t think this is too tricky. Â We want this > > > > because you can only take snapshots of subvolumes, and so it is > > > > important that users be able to create their own discrete > > > > snapshottable targets. > > > > > > > > 2) Users need to be able to snapshot their subvolumes. Â This is > > > > basically the same as #1, but it bears repeating. > > > > > > could it be possible to convert a directory into a volume? or at > > > least base a snapshot off it? > > > > I''m afraid this turns into the same complexity as creating a new > > volume and copying all the files/dirs in by hand. > > Except you wouldn''t have to copy data, only metadata.And it could probably be race-free. If I''d cp -reflink or rsync stuff from a real directory to a subvolume and then rename the old directory to an other name and the subvolume to the directory name then I might be missing files that have been created during the copy process and missing changes to files that have been already copied. What I would like is an easy way to make ~/.kde or whatever a subvolume to be able to snapshot it independently while KDE applications or whatever is using and writing to it, *without* any userland even noticing it and without - except for metadata for managing the subvolume - any additional space consumption. So deepdance:/#12> btrfs subvolume create /home/martin/.kde ERROR: ''/home/martin/.kde'' exists would just make a subvolume out of ~/.kde even if it needs splitting out the tree or even copying the tree data into a new tree. There are other filesystem operations like btrfs filesystem balance that can be expensive as well. All that said from a user point of view. Maybe technical its not feasible. But it would be nice if it can be made feasible without loosing existing advantages. And maybe deepdance:/> btrfs subvolume create . ERROR: ''.'' exists should really remain this way ;). -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Thu, 2010-12-02 at 10:49 +0100, Arne Jansen wrote:> Josef Bacik wrote: > > > > 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > > to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > > that way. This unfortunately will be an incompatible format change, but the > > sooner we get this adressed the easier it will be in the long run. Obviously > > when I say format change I mean via the incompat bits we have, so old fs''s won''t > > be broken and such. > > > > 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > > just do dentry trickery, but that doesn''t make the boundary between subvolumes > > clear, so it will confuse people (and samba) when they walk into a subvolume and > > all of a sudden the inode numbers are the same as in the directory behind them. > > With doing the referral mount thing, each subvolume appears to be its own mount > > and that way things like NFS and samba will work properly. > > > > What about the alternative and allocating inode numbers globally? The only > problem would be with snapshots as they share the inum with the source, but > one could just remap inode numbers in snapshots by sparing some bits at the > top of this 64 bit field. > > Having one mount per subvolume/snapshots is the cleaner solution, but > quickly leads to situations where you have _lots_ of mounts, especially when > you export them via NFS and mount it somewhere else. I''ve seen a machine > which had to handle > 100,000 mounts from a zfs server. This definitely > brings it''s own problems, so I''d love to see a full fs exported as a single > mount. This will also keep output from tools like iostat (for nfs mounts) > and df readable.Having a lot of mounts will be a problem when the mount table is exposed directly from the kernel, something that must be done, and is being done in the latest util-linux. Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2010-12-06 at 09:27 -0500, Josef Bacik wrote:> On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote: > > On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com> wrote: > > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote: > > >> Hello, > > >> > > >> Various people have complained about how BTRFS deals with subvolumes recently, > > >> specifically the fact that they all have the same inode number, and there''s no > > >> discrete seperation from one subvolume to another. Christoph asked that I lay > > >> out a basic design document of how we want subvolumes to work so we can hash > > >> everything out now, fix what is broken, and then move forward with a design that > > >> everybody is more or less happy with. I apologize in advance for how freaking > > >> long this email is going to be. I assume that most people are generally > > >> familiar with how BTRFS works, so I''m not going to bother explaining in great > > >> detail some stuff. > > >> > > >> === What are subvolumes? ==> > >> > > >> They are just another tree. In BTRFS we have various b-trees to describe the > > >> filesystem. A few of them are filesystem wide, such as the extent tree, chunk > > >> tree, root tree etc. The tree''s that hold the actual filesystem data, that is > > >> inodes and such, are kept in their own b-tree. This is how subvolumes and > > >> snapshots appear on disk, they are simply new b-trees with all of the file data > > >> contained within them. > > >> > > >> === What do subvolumes look like? ==> > >> > > >> All the user sees are directories. They act like any other directory acts, with > > >> a few exceptions > > >> > > >> 1) You cannot hardlink between subvolumes. This is because subvolumes have > > >> their own inode numbers and such, think of them as seperate mounts in this case, > > >> you cannot hardlink between two mounts because the link needs to point to the > > >> same on disk inode, which is impossible between two different filesystems. The > > >> same is true for subvolumes, they have their own trees with their own inodes and > > >> inode numbers, so it''s impossible to hardlink between them. > > >> > > >> 1a) In case it wasn''t clear from above, each subvolume has their own inode > > >> numbers, so you can have the same inode numbers used between two different > > >> subvolumes, since they are two different trees. > > >> > > >> 2) Obviously you can''t just rm -rf subvolumes. Because they are roots there''s > > >> extra metadata to keep track of them, so you have to use one of our ioctls to > > >> delete subvolumes/snapshots. > > >> > > >> But permissions and everything else they are the same. > > >> > > >> There is one tricky thing. When you create a subvolume, the directory inode > > >> that is created in the parent subvolume has the inode number of 256. So if you > > >> have a bunch of subvolumes in the same parent subvolume, you are going to have a > > >> bunch of directories with the inode number of 256. This is so when users cd > > >> into a subvolume we can know its a subvolume and do all the normal voodoo to > > >> start looking in the subvolumes tree instead of the parent subvolumes tree. > > >> > > >> This is where things go a bit sideways. We had serious problems with NFS, but > > >> thankfully NFS gives us a bunch of hooks to get around these problems. > > >> CIFS/Samba do not, so we will have problems there, not to mention any other > > >> userspace application that looks at inode numbers. > > >> > > >> === How do we want subvolumes to work from a user perspective? ==> > >> > > >> 1) Users need to be able to create their own subvolumes. The permission > > >> semantics will be absolutely the same as creating directories, so I don''t think > > >> this is too tricky. We want this because you can only take snapshots of > > >> subvolumes, and so it is important that users be able to create their own > > >> discrete snapshottable targets. > > >> > > >> 2) Users need to be able to snapshot their subvolumes. This is basically the > > >> same as #1, but it bears repeating. > > >> > > >> 3) Subvolumes shouldn''t need to be specifically mounted. This is also > > >> important, we don''t want users to have to go around mounting their subvolumes up > > >> manually one-by-one. Today users just cd into subvolumes and it works, just > > >> like cd''ing into a directory. > > >> > > >> === Quotas ==> > >> > > >> This is a huge topic in and of itself, but Christoph mentioned wanting to have > > >> an idea of what we wanted to do with it, so I''m putting it here. There are > > >> really 2 things here > > >> > > >> 1) Limiting the size of subvolumes. This is really easy for us, just create a > > >> subvolume and at creation time set a maximum size it can grow to and not let it > > >> go farther than that. Nice, simple and straightforward. > > >> > > >> 2) Normal quotas, via the quota tools. This just comes down to how do we want > > >> to charge users, do we want to do it per subvolume, or per filesystem. My vote > > >> is per filesystem. Obviously this will make it tricky with snapshots, but I > > >> think if we''re just charging the diff''s between the original volume and the > > >> snapshot to the user then that will be the easiest for people to understand, > > >> rather than making a snapshot all of a sudden count the users currently used > > >> quota * 2. > > >> > > >> === What do we do? ==> > >> > > >> This is where I expect to see the most discussion. Here is what I want to do > > >> > > >> 1) Scrap the 256 inode number thing. Instead we''ll just put a flag in the inode > > >> to say "Hey, I''m a subvolume" and then we can do all of the appropriate magic > > >> that way. This unfortunately will be an incompatible format change, but the > > >> sooner we get this adressed the easier it will be in the long run. Obviously > > >> when I say format change I mean via the incompat bits we have, so old fs''s won''t > > >> be broken and such. > > >> > > >> 2) Do something like NFS''s referral mounts when we cd into a subvolume. Now we > > >> just do dentry trickery, but that doesn''t make the boundary between subvolumes > > >> clear, so it will confuse people (and samba) when they walk into a subvolume and > > >> all of a sudden the inode numbers are the same as in the directory behind them. > > >> With doing the referral mount thing, each subvolume appears to be its own mount > > >> and that way things like NFS and samba will work properly. > > >> > > >> I feel like I''m forgetting something here, hopefully somebody will point it out. > > >> > > >> === Conclusion ==> > >> > > >> There are definitely some wonky things with subvolumes, but I don''t think they > > >> are things that cannot be fixed now. Some of these changes will require > > >> incompat format changes, but it''s either we fix it now, or later on down the > > >> road when BTRFS starts getting used in production really find out how many > > >> things our current scheme breaks and then have to do the changes then. Thanks, > > >> > > > > > > So now that I''ve actually looked at everything, it looks like the semantics are > > > all right for subvolumes > > > > > > 1) readdir - we return the root id in d_ino, which is unique across the fs > > > 2) stat - we return 256 for all subvolumes, because that is their inode number > > > 3) dev_t - we setup an anon super for all volumes, so they all get their own > > > dev_t, which is set properly for all of their children, see below > > > > > > [root@test1244 btrfs-test]# stat . > > > File: `.'' > > > Size: 20 Blocks: 8 IO Block: 4096 directory > > > Device: 15h/21d Inode: 256 Links: 1 > > > Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) > > > Access: 2010-12-03 15:35:41.931679393 -0500 > > > Modify: 2010-12-03 15:35:20.405679493 -0500 > > > Change: 2010-12-03 15:35:20.405679493 -0500 > > > > > > [root@test1244 btrfs-test]# stat foo > > > File: `foo'' > > > Size: 12 Blocks: 0 IO Block: 4096 directory > > > Device: 19h/25d Inode: 256 Links: 1 > > > Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) > > > Access: 2010-12-03 15:35:17.501679393 -0500 > > > Modify: 2010-12-03 15:35:59.150680051 -0500 > > > Change: 2010-12-03 15:35:59.150680051 -0500 > > > > > > [root@test1244 btrfs-test]# stat foo/foobar > > > File: `foo/foobar'' > > > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > > > Device: 19h/25d Inode: 257 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > > > Access: 2010-12-03 15:35:59.150680051 -0500 > > > Modify: 2010-12-03 15:35:59.150680051 -0500 > > > Change: 2010-12-03 15:35:59.150680051 -0500 > > > > > > So as far as the user is concerned, everything should come out right. Obviously > > > we had to do the NFS trickery still because as far as VFS is concerned the > > > subvolumes are all on the same mount. So the question is this (and really this > > > is directed at Christoph and Bruce and anybody else who may care), is this good > > > enough, or do we want to have a seperate vfsmount for each subvolume? Thanks, > > > > > > > What are the drawbacks of having a vfsmount for each subvolume? > > > > Why (besides having to code it up) are you trying to avoid doing it that way? > > It''s the having to code it up that way thing, I''m nothing if not lazy.And, anything that uses the mount table, exposed from the kernel, will grind a system to a halt with only a few thousand mounts, not to mention that user space utilities, like df, du ..., will become painful to use for more than a hundred or so entries.> > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html