thr3ads.net - Btrfs devel - What to do about subvolumes? [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Josef Bacik

2010-Dec-01 14:21 UTC

What to do about subvolumes?

Hello,

Various people have complained about how BTRFS deals with subvolumes recently,
specifically the fact that they all have the same inode number, and
there''s no
discrete seperation from one subvolume to another.  Christoph asked that I lay
out a basic design document of how we want subvolumes to work so we can hash
everything out now, fix what is broken, and then move forward with a design that
everybody is more or less happy with.  I apologize in advance for how freaking
long this email is going to be.  I assume that most people are generally
familiar with how BTRFS works, so I''m not going to bother explaining in
great
detail some stuff.

=== What are subvolumes? ==
They are just another tree.  In BTRFS we have various b-trees to describe the
filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
tree, root tree etc.  The tree''s that hold the actual filesystem data,
that is
inodes and such, are kept in their own b-tree.  This is how subvolumes and
snapshots appear on disk, they are simply new b-trees with all of the file data
contained within them.

=== What do subvolumes look like? ==
All the user sees are directories.  They act like any other directory acts, with
a few exceptions

1) You cannot hardlink between subvolumes.  This is because subvolumes have
their own inode numbers and such, think of them as seperate mounts in this case,
you cannot hardlink between two mounts because the link needs to point to the
same on disk inode, which is impossible between two different filesystems.  The
same is true for subvolumes, they have their own trees with their own inodes and
inode numbers, so it''s impossible to hardlink between them.

1a) In case it wasn''t clear from above, each subvolume has their own
inode
numbers, so you can have the same inode numbers used between two different
subvolumes, since they are two different trees.

2) Obviously you can''t just rm -rf subvolumes.  Because they are roots
there''s
extra metadata to keep track of them, so you have to use one of our ioctls to
delete subvolumes/snapshots.

But permissions and everything else they are the same.

There is one tricky thing.  When you create a subvolume, the directory inode
that is created in the parent subvolume has the inode number of 256.  So if you
have a bunch of subvolumes in the same parent subvolume, you are going to have a
bunch of directories with the inode number of 256.  This is so when users cd
into a subvolume we can know its a subvolume and do all the normal voodoo to
start looking in the subvolumes tree instead of the parent subvolumes tree.

This is where things go a bit sideways.  We had serious problems with NFS, but
thankfully NFS gives us a bunch of hooks to get around these problems.
CIFS/Samba do not, so we will have problems there, not to mention any other
userspace application that looks at inode numbers.

=== How do we want subvolumes to work from a user perspective? ==
1) Users need to be able to create their own subvolumes.  The permission
semantics will be absolutely the same as creating directories, so I
don''t think
this is too tricky.  We want this because you can only take snapshots of
subvolumes, and so it is important that users be able to create their own
discrete snapshottable targets.

2) Users need to be able to snapshot their subvolumes.  This is basically the
same as #1, but it bears repeating.

3) Subvolumes shouldn''t need to be specifically mounted.  This is also
important, we don''t want users to have to go around mounting their
subvolumes up
manually one-by-one.  Today users just cd into subvolumes and it works, just
like cd''ing into a directory.

=== Quotas ==
This is a huge topic in and of itself, but Christoph mentioned wanting to have
an idea of what we wanted to do with it, so I''m putting it here.  There
are
really 2 things here

1) Limiting the size of subvolumes.  This is really easy for us, just create a
subvolume and at creation time set a maximum size it can grow to and not let it
go farther than that.  Nice, simple and straightforward.

2) Normal quotas, via the quota tools.  This just comes down to how do we want
to charge users, do we want to do it per subvolume, or per filesystem.  My vote
is per filesystem.  Obviously this will make it tricky with snapshots, but I
think if we''re just charging the diff''s between the original
volume and the
snapshot to the user then that will be the easiest for people to understand,
rather than making a snapshot all of a sudden count the users currently used
quota * 2.

=== What do we do? ==
This is where I expect to see the most discussion.  Here is what I want to do

1) Scrap the 256 inode number thing.  Instead we''ll just put a flag in
the inode
to say "Hey, I''m a subvolume" and then we can do all of the
appropriate magic
that way.  This unfortunately will be an incompatible format change, but the
sooner we get this adressed the easier it will be in the long run.  Obviously
when I say format change I mean via the incompat bits we have, so old
fs''s won''t
be broken and such.

2) Do something like NFS''s referral mounts when we cd into a subvolume.
Now we
just do dentry trickery, but that doesn''t make the boundary between
subvolumes
clear, so it will confuse people (and samba) when they walk into a subvolume and
all of a sudden the inode numbers are the same as in the directory behind them.
With doing the referral mount thing, each subvolume appears to be its own mount
and that way things like NFS and samba will work properly.

I feel like I''m forgetting something here, hopefully somebody will
point it out.

=== Conclusion ==
There are definitely some wonky things with subvolumes, but I don''t
think they
are things that cannot be fixed now.  Some of these changes will require
incompat format changes, but it''s either we fix it now, or later on
down the
road when BTRFS starts getting used in production really find out how many
things our current scheme breaks and then have to do the changes then.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Hommey

2010-Dec-01 14:50 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik
wrote:> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I
don''t think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
> 
> 2) Users need to be able to snapshot their subvolumes.  This is basically
the
> same as #1, but it bears repeating.
> 
> 3) Subvolumes shouldn''t need to be specifically mounted.  This is
also
> important, we don''t want users to have to go around mounting their
subvolumes up
> manually one-by-one.  Today users just cd into subvolumes and it works,
just
> like cd''ing into a directory.
It would be helpful to be able to create subvolumes off existing
directories, instead of creating a subvolume and having to copy all the
data around.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

C Anthony Risinger

2010-Dec-01 14:51 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com>
wrote:>
> === How do we want subvolumes to work from a user perspective? ==>
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I
don''t think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes.  This is basically
the
> same as #1, but it bears repeating.
could it be possible to convert a directory into a volume?  or at
least base a snapshot off it?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Dec-01 16:00 UTC

head link

Re: What to do about subvolumes?

Excerpts from Josef Bacik''s message of 2010-12-01 09:21:36
-0500:> Hello,
> 
> Various people have complained about how BTRFS deals with subvolumes
recently,
> specifically the fact that they all have the same inode number, and
there''s no
> discrete seperation from one subvolume to another.  Christoph asked that I
lay
> out a basic design document of how we want subvolumes to work so we can
hash
> everything out now, fix what is broken, and then move forward with a design
that
> everybody is more or less happy with.  I apologize in advance for how
freaking
> long this email is going to be.  I assume that most people are generally
> familiar with how BTRFS works, so I''m not going to bother
explaining in great
> detail some stuff.
Thanks for writing this up.
> === What do we do? ==> 
> This is where I expect to see the most discussion.  Here is what I want to
do
> 
> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
> to say "Hey, I''m a subvolume" and then we can do all of
the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but
the
> sooner we get this adressed the easier it will be in the long run. 
Obviously
> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> be broken and such.
If they don''t have inode number 256, what inode number do they have?
I''m assuming you mean the subvolume is given an inode number in the
parent directory just like any other dir,  but this doesn''t get rid of
the duplicate inode problem.  I think it ends up making it less clear,
but I''m open to suggestions ;)

We could give each subvol a different devt, which is something Christoph
had asked about as well.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Dec-01 16:01 UTC

head link

Re: What to do about subvolumes?

Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55
-0500:> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com> wrote:
> >
> > === How do we want subvolumes to work from a user perspective? ==>
>
> > 1) Users need to be able to create their own subvolumes. Â The
permission
> > semantics will be absolutely the same as creating directories, so I
don''t think
> > this is too tricky. Â We want this because you can only take snapshots
of
> > subvolumes, and so it is important that users be able to create their
own
> > discrete snapshottable targets.
> >
> > 2) Users need to be able to snapshot their subvolumes. Â This is
basically the
> > same as #1, but it bears repeating.
> 
> could it be possible to convert a directory into a volume?  or at
> least base a snapshot off it?
I''m afraid this turns into the same complexity as creating a new volume
and copying all the files/dirs in by hand.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

C Anthony Risinger

2010-Dec-01 16:03 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <chris.mason@oracle.com>
wrote:> Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55
-0500:
>> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com>
wrote:
>> >
>> > === How do we want subvolumes to work from a user perspective?
==>> >
>> > 1) Users need to be able to create their own subvolumes.  The
permission
>> > semantics will be absolutely the same as creating directories, so
I don''t think
>> > this is too tricky.  We want this because you can only take
snapshots of
>> > subvolumes, and so it is important that users be able to create
their own
>> > discrete snapshottable targets.
>> >
>> > 2) Users need to be able to snapshot their subvolumes.  This is
basically the
>> > same as #1, but it bears repeating.
>>
>> could it be possible to convert a directory into a volume?  or at
>> least base a snapshot off it?
>
> I''m afraid this turns into the same complexity as creating a new
volume
> and copying all the files/dirs in by hand.
ok; if i create an empty volume, and use cp --reflink, it would have
the desired affect though, right?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Dec-01 16:13 UTC

head link

Re: What to do about subvolumes?

Excerpts from C Anthony Risinger''s message of 2010-12-01 11:03:23
-0500:> On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <chris.mason@oracle.com>
wrote:
> > Excerpts from C Anthony Risinger''s message of 2010-12-01
09:51:55 -0500:
> >> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik
<josef@redhat.com> wrote:
> >> >
> >> > === How do we want subvolumes to work from a user
perspective? ==> >> >
> >> > 1) Users need to be able to create their own subvolumes.
Â The permission
> >> > semantics will be absolutely the same as creating
directories, so I don''t think
> >> > this is too tricky. Â We want this because you can only take
snapshots of
> >> > subvolumes, and so it is important that users be able to
create their own
> >> > discrete snapshottable targets.
> >> >
> >> > 2) Users need to be able to snapshot their subvolumes. Â This
is basically the
> >> > same as #1, but it bears repeating.
> >>
> >> could it be possible to convert a directory into a volume? Â or at
> >> least base a snapshot off it?
> >
> > I''m afraid this turns into the same complexity as creating a
new volume
> > and copying all the files/dirs in by hand.
> 
> ok; if i create an empty volume, and use cp --reflink, it would have
> the desired affect though, right?
Almost, for no good reason at all our cp --reflink doesn''t reflink
across subvols.  I''ll get that fixed up.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Hommey

2010-Dec-01 16:31 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason
wrote:> Excerpts from C Anthony Risinger''s message of 2010-12-01 09:51:55
-0500:
> > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <josef@redhat.com>
wrote:
> > >
> > > === How do we want subvolumes to work from a user perspective?
==> > >
> > > 1) Users need to be able to create their own subvolumes. Â The
permission
> > > semantics will be absolutely the same as creating directories, so
I don''t think
> > > this is too tricky. Â We want this because you can only take
snapshots of
> > > subvolumes, and so it is important that users be able to create
their own
> > > discrete snapshottable targets.
> > >
> > > 2) Users need to be able to snapshot their subvolumes. Â This is
basically the
> > > same as #1, but it bears repeating.
> > 
> > could it be possible to convert a directory into a volume?  or at
> > least base a snapshot off it?
> 
> I''m afraid this turns into the same complexity as creating a new
volume
> and copying all the files/dirs in by hand.
Except you wouldn''t have to copy data, only metadata.

Mike
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2010-Dec-01 16:38 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik
wrote:> === Quotas ==> 
> This is a huge topic in and of itself, but Christoph mentioned wanting to
have
> an idea of what we wanted to do with it, so I''m putting it here. 
There are
> really 2 things here
> 
> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
> subvolume and at creation time set a maximum size it can grow to and not
let it
> go farther than that.  Nice, simple and straightforward.
> 
> 2) Normal quotas, via the quota tools.  This just comes down to how do we
want
> to charge users, do we want to do it per subvolume, or per filesystem.  My
vote
> is per filesystem.  Obviously this will make it tricky with snapshots, but
I
> think if we''re just charging the diff''s between the
original volume and the
> snapshot to the user then that will be the easiest for people to
understand,
> rather than making a snapshot all of a sudden count the users currently
used
> quota * 2.
   This is going to be tricky to get the semantics right, I suspect.

   Say you''ve created a subvolume, A, containing 10G of Useful Stuff
(say, a base image for VMs). This counts 10G against your quota. Now,
I come along and snapshot that subvolume (as a writable subvolume) --
call it B. This is essentially free for me, because I''ve got a COW
copy of your subvolume (and the original counts against your quota).

   If I now modify a file in subvolume B, the full modified section
goes onto my quota. This is all well and good. But what happens if you
delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
files.  Worse, what happens if someone else had made a snapshot of A,
too? Who gets the 10G added to their quota, me or them? What if I''d
filled up my quota? Would that stop you from deleting your copy,
because my copy can''t be charged against my quota? Would I just end up
unexpectedly 10G over quota?

   This is a whole gigantic can of worms, as far as I can see, and I
don''t think it''s going to be possible to implement quotas,
even on a
filesystem level, until there''s some good and functional model for
dealing with all the implications of COW copies. :(

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- I believe that it''s closely correlated with ---
                       the aeroswine coefficient.

Gordan Bobic

2010-Dec-01 16:48 UTC

head link

Re: What to do about subvolumes?

Hugo Mills wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> === Quotas ==>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting
to have
>> an idea of what we wanted to do with it, so I''m putting it
here.  There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
>> subvolume and at creation time set a maximum size it can grow to and
not let it
>> go farther than that.  Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools.  This just comes down to how do
we want
>> to charge users, do we want to do it per subvolume, or per filesystem. 
My vote
>> is per filesystem.  Obviously this will make it tricky with snapshots,
but I
>> think if we''re just charging the diff''s between the
original volume and the
>> snapshot to the user then that will be the easiest for people to
understand,
>> rather than making a snapshot all of a sudden count the users currently
used
>> quota * 2.
> 
>    This is going to be tricky to get the semantics right, I suspect.
> 
>    Say you''ve created a subvolume, A, containing 10G of Useful
Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I''ve got a COW
> copy of your subvolume (and the original counts against your quota).
> 
>    If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if
I''d
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can''t be charged against my quota? Would I just
end up
> unexpectedly 10G over quota?
> 
>    This is a whole gigantic can of worms, as far as I can see, and I
> don''t think it''s going to be possible to implement
quotas, even on a
> filesystem level, until there''s some good and functional model for
> dealing with all the implications of COW copies. :(
I would argue that a simple and probably correct solution is to have the 
files count toward the quota of everyone who has a COW copy. i.e. if I 
have a volume A and you make a snapshot B, the du content of B should 
count toward your quota as well, rather than being "free". I
don''t see
any reason why this would not be the correct and intuitive way to do it. 
Simply treat it as you would transparent block-level deduplication.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Hommey

2010-Dec-01 16:52 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ==> > 
> > This is a huge topic in and of itself, but Christoph mentioned wanting
to have
> > an idea of what we wanted to do with it, so I''m putting it
here.  There are
> > really 2 things here
> > 
> > 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
> > subvolume and at creation time set a maximum size it can grow to and
not let it
> > go farther than that.  Nice, simple and straightforward.
> > 
> > 2) Normal quotas, via the quota tools.  This just comes down to how do
we want
> > to charge users, do we want to do it per subvolume, or per filesystem.
My vote
> > is per filesystem.  Obviously this will make it tricky with snapshots,
but I
> > think if we''re just charging the diff''s between the
original volume and the
> > snapshot to the user then that will be the easiest for people to
understand,
> > rather than making a snapshot all of a sudden count the users
currently used
> > quota * 2.
> 
>    This is going to be tricky to get the semantics right, I suspect.
> 
>    Say you''ve created a subvolume, A, containing 10G of Useful
Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I''ve got a COW
> copy of your subvolume (and the original counts against your quota).
> 
>    If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if
I''d
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can''t be charged against my quota? Would I just
end up
> unexpectedly 10G over quota?
> 
>    This is a whole gigantic can of worms, as far as I can see, and I
> don''t think it''s going to be possible to implement
quotas, even on a
> filesystem level, until there''s some good and functional model for
> dealing with all the implications of COW copies. :(
In your case, it would sound fair that everyone is "simply" charged
10G.
What Josef is refering to would probably only apply to volumes and
snapshots owned by the same user: If I have a subvolume of 10G, and a
snapshot of it where I only changed 1G, the charged quota would be 11G,
not 20G.

Mike
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

C Anthony Risinger

2010-Dec-01 16:52 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 10:38 AM, Hugo Mills <hugo-lkml@carfax.org.uk>
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> === Quotas ==>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting
to have
>> an idea of what we wanted to do with it, so I''m putting it
here.  There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
>> subvolume and at creation time set a maximum size it can grow to and
not let it
>> go farther than that.  Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools.  This just comes down to how do
we want
>> to charge users, do we want to do it per subvolume, or per filesystem.
 My vote
>> is per filesystem.  Obviously this will make it tricky with snapshots,
but I
>> think if we''re just charging the diff''s between the
original volume and the
>> snapshot to the user then that will be the easiest for people to
understand,
>> rather than making a snapshot all of a sudden count the users currently
used
>> quota * 2.
>
>   This is going to be tricky to get the semantics right, I suspect.
>
>   Say you''ve created a subvolume, A, containing 10G of Useful
Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I''ve got a COW
> copy of your subvolume (and the original counts against your quota).
>
>   If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if
I''d
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can''t be charged against my quota? Would I just
end up
> unexpectedly 10G over quota?
>
>   This is a whole gigantic can of worms, as far as I can see, and I
> don''t think it''s going to be possible to implement
quotas, even on a
> filesystem level, until there''s some good and functional model for
> dealing with all the implications of COW copies. :(
i''d expect that as a separate user, you should both be whacked 10G.
imo, the whole benefit of transparent COW is to the administrators
advantage, thus i would even think the _uncompressed_ volume size
would go against quota (which could possibly be artificially inflated
to account for the space saving of compression).  users just need a
nice steadily predictable number to monitor.

thought maybe these users could be grouped, such that the COW''ed
portions of the files they share are balanced across each users quota,
but this would have to be a soprt of "opt in" thing else you get the
wild fluctuations because of other user''s actions.  additionally, some
users could be marked as "system", where COW''ing their subvol
results
in 0 quota -- you only pay for what you change -- but if the system
subvol gets removed, then you pay for it all.  in this way you would
have to keep reusing system subvols to get any advantage as a regular
user.

i dont know the existing systems though so i dont know what it would
take to do such balancing.

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-01 17:38 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ==> > 
> > This is a huge topic in and of itself, but Christoph mentioned wanting
to have
> > an idea of what we wanted to do with it, so I''m putting it
here.  There are
> > really 2 things here
> > 
> > 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
> > subvolume and at creation time set a maximum size it can grow to and
not let it
> > go farther than that.  Nice, simple and straightforward.
> > 
> > 2) Normal quotas, via the quota tools.  This just comes down to how do
we want
> > to charge users, do we want to do it per subvolume, or per filesystem.
My vote
> > is per filesystem.  Obviously this will make it tricky with snapshots,
but I
> > think if we''re just charging the diff''s between the
original volume and the
> > snapshot to the user then that will be the easiest for people to
understand,
> > rather than making a snapshot all of a sudden count the users
currently used
> > quota * 2.
> 
>    This is going to be tricky to get the semantics right, I suspect.
> 
>    Say you''ve created a subvolume, A, containing 10G of Useful
Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I''ve got a COW
> copy of your subvolume (and the original counts against your quota).
> 
>    If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files.  Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if
I''d
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can''t be charged against my quota? Would I just
end up
> unexpectedly 10G over quota?
> 
If you delete your subvolume A, like use the btrfs tool to delete it, you will
only be stuck with what you changed in snapshot B.  So if you only changed 5gig
worth of information, and you deleted the original subvolume, you would have
5gig charged to your quota.  The idea is you are only charged for what blocks
you have on the disk.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2010-Dec-01 18:33 UTC

head link

Re: What to do about subvolumes?

On Wednesday, 01 December, 2010, Josef Bacik wrote:> Hello,
> 
Hi Josef
> 
> === What are subvolumes? ==> 
> They are just another tree.  In BTRFS we have various b-trees to describe 
the> filesystem.  A few of them are filesystem wide, such as the extent tree, 
chunk> tree, root tree etc.  The tree''s that hold the actual filesystem
data, that
is> inodes and such, are kept in their own b-tree.  This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file 
data> contained within them.
> 
> === What do subvolumes look like? ==> 
[...]> 
> 2) Obviously you can''t just rm -rf subvolumes.  Because they are
roots
there''s> extra metadata to keep track of them, so you have to use one of our ioctls 
to> delete subvolumes/snapshots.
Sorry, but I can''t understand this sentence. It is clear that a
directory and
a subvolume have a totally different on-disk format. But why it would be not 
possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a 
patch some months ago: when the rmdir is invoked on a subvolume, the same 
action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.

See https://patchwork.kernel.org/patch/260301/
 
[...]> 
> There is one tricky thing.  When you create a subvolume, the directory
inode
> that is created in the parent subvolume has the inode number of 256.  So if
you> have a bunch of subvolumes in the same parent subvolume, you are going to 
have a> bunch of directories with the inode number of 256.  This is so when users
cd
> into a subvolume we can know its a subvolume and do all the normal voodoo
to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
> 
> This is where things go a bit sideways.  We had serious problems with NFS, 
but> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.
How this is/should be different of a mounted filesystem ?
For example:

# cd /tmp
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
# mkdir mount -a; mkdir mount-b
# mount /dev/sda6 mount-a		# an ext4 fs
# mount /dev/sdb2 mount-b		# an ext3 fs
# $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
     256 sub-a
     256 sub-b
       2 mount-a
       2 mount-b

In this case the inode-number returned are equal for both the mounted 
filesystems and the subvolumes. However, the fsid is different.

# stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
cdc937c1a203df74 sub-a
cdc937c1a203df77 sub-b
b27d147f003561c8 mount-a
d49e1a3d2333d2e1 mount-b
cdc937c1a203df75 .

Moreover I suggest to look at the difference of the inode returned by 
readdir(3) and stat(3)..

[...]> I feel like I''m forgetting something here, hopefully somebody will
point it
out.> 
Another point that I want like to discuss is how manage the "pivoting"
between
the subvolumes. One of the most beautiful feature of btrfs is the snapshot 
capability. In fact it is possible to make a snapshot of the root of the 
filesystem and to mount it in a subsequent reboot.
But is very complicated to manage the pivoting of a snapshot of a root 
filesystem, because I cannot delete the "old root" due to the fact
that the
"new root" is placed in the "old root".

A possible solution is not to put the root of the filesystem (where are placed 
/usr, /etc....) in the root of the btrfs filesystem; but it should be accepted 
from the beginning the idea that the root of a filesystem should be placed in 
a subvolume which int turn is placed in the root of a btrfs filesystem...

I am open to other opinions.
> === Conclusion ==> 
> There are definitely some wonky things with subvolumes, but I
don''t think
they> are things that cannot be fixed now.  Some of these changes will require
> incompat format changes, but it''s either we fix it now, or later
on down the
> road when BTRFS starts getting used in production really find out how many
> things our current scheme breaks and then have to do the changes then.  
Thanks,> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-01 18:36 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli
wrote:> On Wednesday, 01 December, 2010, Josef Bacik wrote:
> > Hello,
> > 
> 
> Hi Josef
> 
> > 
> > === What are subvolumes? ==> > 
> > They are just another tree.  In BTRFS we have various b-trees to
describe
> the
> > filesystem.  A few of them are filesystem wide, such as the extent
tree,
> chunk
> > tree, root tree etc.  The tree''s that hold the actual
filesystem data, that
> is
> > inodes and such, are kept in their own b-tree.  This is how subvolumes
and
> > snapshots appear on disk, they are simply new b-trees with all of the
file
> data
> > contained within them.
> > 
> > === What do subvolumes look like? ==> > 
> [...]
> > 
> > 2) Obviously you can''t just rm -rf subvolumes.  Because they
are roots
> there''s
> > extra metadata to keep track of them, so you have to use one of our
ioctls
> to
> > delete subvolumes/snapshots.
> 
> Sorry, but I can''t understand this sentence. It is clear that a
directory and
> a subvolume have a totally different on-disk format. But why it would be
not
> possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a
> patch some months ago: when the rmdir is invoked on a subvolume, the same 
> action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.
> 
> See https://patchwork.kernel.org/patch/260301/
>  
Oh hey thats cool.  That would be reasonable I think.  I was just saying that
currently we can''t remove subvolumes/snapshots via rm, not that it
wasn''t
possible at all.  So I think what you did would be a good thing to have.
> [...]
> > 
> > There is one tricky thing.  When you create a subvolume, the directory
inode
> > that is created in the parent subvolume has the inode number of 256. 
So if
> you
> > have a bunch of subvolumes in the same parent subvolume, you are going
to
> have a
> > bunch of directories with the inode number of 256.  This is so when
users cd
> > into a subvolume we can know its a subvolume and do all the normal
voodoo to
> > start looking in the subvolumes tree instead of the parent subvolumes
tree.
> > 
> > This is where things go a bit sideways.  We had serious problems with
NFS,
> but
> > thankfully NFS gives us a bunch of hooks to get around these problems.
> > CIFS/Samba do not, so we will have problems there, not to mention any
other
> > userspace application that looks at inode numbers.
> 
> How this is/should be different of a mounted filesystem ?
> For example:
> 
> # cd /tmp
> # btrfs subvolume create sub-a
> # btrfs subvolume create sub-b
> # mkdir mount -a; mkdir mount-b
> # mount /dev/sda6 mount-a		# an ext4 fs
> # mount /dev/sdb2 mount-b		# an ext3 fs
> # $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
>      256 sub-a
>      256 sub-b
>        2 mount-a
>        2 mount-b
> 
> In this case the inode-number returned are equal for both the mounted 
> filesystems and the subvolumes. However, the fsid is different.
> 
> # stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
> cdc937c1a203df74 sub-a
> cdc937c1a203df77 sub-b
> b27d147f003561c8 mount-a
> d49e1a3d2333d2e1 mount-b
> cdc937c1a203df75 .
> 
> Moreover I suggest to look at the difference of the inode returned by 
> readdir(3) and stat(3)..
>
Yeah you are right, the inode numbering can probably be the same, we just need
to make them logically different mounts so things like NFS and samba still work
right.
> [...]
> > I feel like I''m forgetting something here, hopefully somebody
will point it
> out.
> > 
> 
> Another point that I want like to discuss is how manage the
"pivoting" between
> the subvolumes. One of the most beautiful feature of btrfs is the snapshot 
> capability. In fact it is possible to make a snapshot of the root of the 
> filesystem and to mount it in a subsequent reboot.
> But is very complicated to manage the pivoting of a snapshot of a root 
> filesystem, because I cannot delete the "old root" due to the
fact that the
> "new root" is placed in the "old root".
> 
> A possible solution is not to put the root of the filesystem (where are
placed
> /usr, /etc....) in the root of the btrfs filesystem; but it should be
accepted
> from the beginning the idea that the root of a filesystem should be placed
in
> a subvolume which int turn is placed in the root of a btrfs filesystem...
> 
> I am open to other opinions.
> 
Agreed, one of the things that Chris and I have discussed is the possiblity of
just having dangling roots, since really the directories are just an easy way to
get to the subvolumes.  This would let you delete the original volume and use
the snapshot from then on out.  Something to do in the future for sure.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

C Anthony Risinger

2010-Dec-01 18:48 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <josef@redhat.com>
wrote:> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>
>> Another point that I want like to discuss is how manage the
"pivoting" between
>> the subvolumes. One of the most beautiful feature of btrfs is the
snapshot
>> capability. In fact it is possible to make a snapshot of the root of
the
>> filesystem and to mount it in a subsequent reboot.
>> But is very complicated to manage the pivoting of a snapshot of a root
>> filesystem, because I cannot delete the "old root" due to the
fact that the
>> "new root" is placed in the "old root".
>>
>> A possible solution is not to put the root of the filesystem (where are
placed
>> /usr, /etc....) in the root of the btrfs filesystem; but it should be
accepted
>> from the beginning the idea that the root of a filesystem should be
placed in
>> a subvolume which int turn is placed in the root of a btrfs
filesystem...
>>
>> I am open to other opinions.
>>
>
> Agreed, one of the things that Chris and I have discussed is the possiblity
of
> just having dangling roots, since really the directories are just an easy
way to
> get to the subvolumes.  This would let you delete the original volume and
use
> the snapshot from then on out.  Something to do in the future for sure.
i would really like to see a solution to this particular issue.  i may
be missing something, but the dangling subvol roots doesn''t seem to
address the management of the root volume itself.

for example... most people will install their whole system into the
real root (id=5), but this renders the system unmanageable, because
there is no way to ever empty it without manually issuing an `rm -rf`.

i''m having a really hard time controlling this with the initramfs hook
i provide for archlinux users.  the hook requires a specific structure
"underneath" what the user perceives as /, but i can only accomplish
this for new installs -- for existing installs i can setup the proper
"subroot" structure, and snapshot their current root... but i cannot
remove the stagnant files in the real root (id=5) that well never,
ever be accessed again.

... or does dangling roots address this?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

C Anthony Risinger

2010-Dec-01 18:52 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 12:48 PM, C Anthony Risinger <anthony@extof.me>
wrote:> On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <josef@redhat.com>
wrote:
>> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>>
>>> Another point that I want like to discuss is how manage the
"pivoting" between
>>> the subvolumes. One of the most beautiful feature of btrfs is the
snapshot
>>> capability. In fact it is possible to make a snapshot of the root
of the
>>> filesystem and to mount it in a subsequent reboot.
>>> But is very complicated to manage the pivoting of a snapshot of a
root
>>> filesystem, because I cannot delete the "old root" due to
the fact that the
>>> "new root" is placed in the "old root".
>>>
>>> A possible solution is not to put the root of the filesystem (where
are placed
>>> /usr, /etc....) in the root of the btrfs filesystem; but it should
be accepted
>>> from the beginning the idea that the root of a filesystem should be
placed in
>>> a subvolume which int turn is placed in the root of a btrfs
filesystem...
>>>
>>> I am open to other opinions.
>>>
>>
>> Agreed, one of the things that Chris and I have discussed is the
possiblity of
>> just having dangling roots, since really the directories are just an
easy way to
>> get to the subvolumes.  This would let you delete the original volume
and use
>> the snapshot from then on out.  Something to do in the future for sure.
>
> i would really like to see a solution to this particular issue.  i may
> be missing something, but the dangling subvol roots doesn''t seem
to
> address the management of the root volume itself.
>
> for example... most people will install their whole system into the
> real root (id=5), but this renders the system unmanageable, because
> there is no way to ever empty it without manually issuing an `rm -rf`.
>
> i''m having a really hard time controlling this with the initramfs
hook
> i provide for archlinux users.  the hook requires a specific structure
> "underneath" what the user perceives as /, but i can only
accomplish
> this for new installs -- for existing installs i can setup the proper
> "subroot" structure, and snapshot their current root... but i
cannot
> remove the stagnant files in the real root (id=5) that well never,
> ever be accessed again.
>
> ... or does dangling roots address this?
i forgot to mention, but a quick ''n dirty solution would be to simply
not enable users to do this by accident.  mkfs.btrfs could create a
new subvol, then mark it as default... this way the user has to
manually mount with id=0, or remark 0 as the default.

effectively, users would be unknowingly be installing into a
subvolume, rather then the top-level root (apologies if my terminology
is incorrect).

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2010-Dec-01 19:08 UTC

head link

Re: What to do about subvolumes?

On Wednesday, 01 December, 2010, you (C Anthony Risinger) wrote:
[...]> i forgot to mention, but a quick ''n dirty solution would be to
simply
> not enable users to do this by accident.  mkfs.btrfs could create a
> new subvol, then mark it as default... this way the user has to
> manually mount with id=0, or remark 0 as the default.
> 
> effectively, users would be unknowingly be installing into a
> subvolume, rather then the top-level root (apologies if my terminology
> is incorrect).
I fully agree: it fulfill the KISS principle :-)
> C Anthony
> 

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2010-Dec-01 19:35 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik
wrote:> On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > === Quotas ==> > > 
> > > This is a huge topic in and of itself, but Christoph mentioned
wanting to have
> > > an idea of what we wanted to do with it, so I''m putting
it here.  There are
> > > really 2 things here
> > > 
> > > 1) Limiting the size of subvolumes.  This is really easy for us,
just create a
> > > subvolume and at creation time set a maximum size it can grow to
and not let it
> > > go farther than that.  Nice, simple and straightforward.
> > > 
> > > 2) Normal quotas, via the quota tools.  This just comes down to
how do we want
> > > to charge users, do we want to do it per subvolume, or per
filesystem.  My vote
> > > is per filesystem.  Obviously this will make it tricky with
snapshots, but I
> > > think if we''re just charging the diff''s between
the original volume and the
> > > snapshot to the user then that will be the easiest for people to
understand,
> > > rather than making a snapshot all of a sudden count the users
currently used
> > > quota * 2.
> > 
> >    This is going to be tricky to get the semantics right, I suspect.
> > 
> >    Say you''ve created a subvolume, A, containing 10G of
Useful Stuff
> > (say, a base image for VMs). This counts 10G against your quota. Now,
> > I come along and snapshot that subvolume (as a writable subvolume) --
> > call it B. This is essentially free for me, because I''ve got
a COW
> > copy of your subvolume (and the original counts against your quota).
> > 
> >    If I now modify a file in subvolume B, the full modified section
> > goes onto my quota. This is all well and good. But what happens if you
> > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> > files.  Worse, what happens if someone else had made a snapshot of A,
> > too? Who gets the 10G added to their quota, me or them? What if
I''d
> > filled up my quota? Would that stop you from deleting your copy,
> > because my copy can''t be charged against my quota? Would I
just end up
> > unexpectedly 10G over quota?
> > 
> 
> If you delete your subvolume A, like use the btrfs tool to delete it, you
will
> only be stuck with what you changed in snapshot B.  So if you only changed
5gig
> worth of information, and you deleted the original subvolume, you would
have
> 5gig charged to your quota.
   This doesn''t work, though, if the owners of the "original"
and
"new" subvolume are different:

Case 1:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos''s data.
 * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of
   Porthos''s data to Athos.
 * Porthos deletes his copy of the data.

Case 2:

 * Porthos creates 10G of data.
 * Athos makes a snapshot of Porthos''s data.
 * Porthos deletes his copy of the data.
 * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of
   Porthos''s data to Athos.

Case 3:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos''s data.
 * Aramis makes a snapshot of Porthos''s data.
 * A sysadmin (Richelieu) changes the ownership on Athos''s snapshot of
   Porthos''s data to Athos.
 * Porthos deletes his copy of the data.

Case 4:

 * Porthos creates 10G data.
 * Athos makes a snapshot of Porthos''s data.
 * Aramis makes a snapshot of Athos''s data.
 * Porthos deletes his copy of the data.
   [Consider also Richelieu changing ownerships of Athos''s and
Aramis''s
   data at alternative points in this sequence]

   In each of these, who gets charged (and how much) for their copy of
the data?
>  The idea is you are only charged for what blocks
> you have on the disk.  Thanks,
   My point was that it''s perfectly possible to have blocks on the
disk that are effectively owned by two people, and that the person to
charge for those blocks is, to me, far from clear. You either end up
charging twice for a single set of blocks on the disk, or you end up
in a situation where one person''s actions can cause another
person''s
quota to fill up. Neither of these is particularly obvious behaviour.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- I believe that it''s closely correlated with ---
                       the aeroswine coefficient.

J. Bruce Fields

2010-Dec-01 19:44 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik
wrote:> Hello,
> 
> Various people have complained about how BTRFS deals with subvolumes
recently,
> specifically the fact that they all have the same inode number, and
there''s no
> discrete seperation from one subvolume to another.  Christoph asked that I
lay
> out a basic design document of how we want subvolumes to work so we can
hash
> everything out now, fix what is broken, and then move forward with a design
that
> everybody is more or less happy with.  I apologize in advance for how
freaking
> long this email is going to be.  I assume that most people are generally
> familiar with how BTRFS works, so I''m not going to bother
explaining in great
> detail some stuff.
> 
> === What are subvolumes? ==> 
> They are just another tree.  In BTRFS we have various b-trees to describe
the
> filesystem.  A few of them are filesystem wide, such as the extent tree,
chunk
> tree, root tree etc.  The tree''s that hold the actual filesystem
data, that is
> inodes and such, are kept in their own b-tree.  This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file
data
> contained within them.
> 
> === What do subvolumes look like? ==> 
> All the user sees are directories.  They act like any other directory acts,
with
> a few exceptions
> 
> 1) You cannot hardlink between subvolumes.  This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this
case,
> you cannot hardlink between two mounts because the link needs to point to
the
> same on disk inode, which is impossible between two different filesystems. 
The
> same is true for subvolumes, they have their own trees with their own
inodes and
> inode numbers, so it''s impossible to hardlink between them.
OK, so I''m unclear: would it be possible for nfsd to export subvolumes
independently?

For that to work, we need to be able to take an inode that we just
looked up by filehandle, and see which subvolume it belongs in.  So if
two subvolumes can point to the same inode, it doesn''t work, but if
st_dev is different between them, e.g., that''d be fine.  Sounds like
you''re seeing the latter is possible, good!
> 
> 1a) In case it wasn''t clear from above, each subvolume has their
own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
> 
> 2) Obviously you can''t just rm -rf subvolumes.  Because they are
roots there''s
> extra metadata to keep track of them, so you have to use one of our ioctls
to
> delete subvolumes/snapshots.
> 
> But permissions and everything else they are the same.
> 
> There is one tricky thing.  When you create a subvolume, the directory
inode
> that is created in the parent subvolume has the inode number of 256.
Is that the right way to say this?  Doing a quick test, the inode
numbers that a readdir of the parent directory returns *are* distinct.
It''s just the inode number that you get when you stat that is
different.

Which is all fine and normal, *if* you treat this as a real mountpoint
with its own vfsmount, st_dev, etc.
> === How do we want subvolumes to work from a user perspective? ==> 
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I
don''t think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
> 
> 2) Users need to be able to snapshot their subvolumes.  This is basically
the
> same as #1, but it bears repeating.
> 
> 3) Subvolumes shouldn''t need to be specifically mounted.  This is
also
> important, we don''t want users to have to go around mounting their
subvolumes up
> manually one-by-one.  Today users just cd into subvolumes and it works,
just
> like cd''ing into a directory.
And the separate nfsd exports is another thing I''d really love to see
work: currently you can export a subtree of a filesystem if you want,
but it''s trivial to escape the subtree by guessing filehandles.  So
this
gives us an easy way for administrators to create secure separate
exports without having to manage entirely separate volumes.

If subvolumes got real mountpoints and so on, this would be easy.

--b.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-01 19:54 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 02:44:04PM -0500, J. Bruce Fields
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > Hello,
> > 
> > Various people have complained about how BTRFS deals with subvolumes
recently,
> > specifically the fact that they all have the same inode number, and
there''s no
> > discrete seperation from one subvolume to another.  Christoph asked
that I lay
> > out a basic design document of how we want subvolumes to work so we
can hash
> > everything out now, fix what is broken, and then move forward with a
design that
> > everybody is more or less happy with.  I apologize in advance for how
freaking
> > long this email is going to be.  I assume that most people are
generally
> > familiar with how BTRFS works, so I''m not going to bother
explaining in great
> > detail some stuff.
> > 
> > === What are subvolumes? ==> > 
> > They are just another tree.  In BTRFS we have various b-trees to
describe the
> > filesystem.  A few of them are filesystem wide, such as the extent
tree, chunk
> > tree, root tree etc.  The tree''s that hold the actual
filesystem data, that is
> > inodes and such, are kept in their own b-tree.  This is how subvolumes
and
> > snapshots appear on disk, they are simply new b-trees with all of the
file data
> > contained within them.
> > 
> > === What do subvolumes look like? ==> > 
> > All the user sees are directories.  They act like any other directory
acts, with
> > a few exceptions
> > 
> > 1) You cannot hardlink between subvolumes.  This is because subvolumes
have
> > their own inode numbers and such, think of them as seperate mounts in
this case,
> > you cannot hardlink between two mounts because the link needs to point
to the
> > same on disk inode, which is impossible between two different
filesystems.  The
> > same is true for subvolumes, they have their own trees with their own
inodes and
> > inode numbers, so it''s impossible to hardlink between them.
> 
> OK, so I''m unclear: would it be possible for nfsd to export
subvolumes
> independently?
> 
Yeah.
> For that to work, we need to be able to take an inode that we just
> looked up by filehandle, and see which subvolume it belongs in.  So if
> two subvolumes can point to the same inode, it doesn''t work, but
if
> st_dev is different between them, e.g., that''d be fine.  Sounds
like
> you''re seeing the latter is possible, good!
> 
So you can''t have the same inode in two subvolumes, since they are
different
trees.  You can have the same inode numbers between two subvolumes, because they
are different trees.
> > 
> > 1a) In case it wasn''t clear from above, each subvolume has
their own inode
> > numbers, so you can have the same inode numbers used between two
different
> > subvolumes, since they are two different trees.
> > 
> > 2) Obviously you can''t just rm -rf subvolumes.  Because they
are roots there''s
> > extra metadata to keep track of them, so you have to use one of our
ioctls to
> > delete subvolumes/snapshots.
> > 
> > But permissions and everything else they are the same.
> > 
> > There is one tricky thing.  When you create a subvolume, the directory
inode
> > that is created in the parent subvolume has the inode number of 256.
> 
> Is that the right way to say this?  Doing a quick test, the inode
> numbers that a readdir of the parent directory returns *are* distinct.
> It''s just the inode number that you get when you stat that is
different.
> 
> Which is all fine and normal, *if* you treat this as a real mountpoint
> with its own vfsmount, st_dev, etc.
> 
Oh well crud, I was hoping that I could leave the inode numbers as 256 for
everything, but I forgot about readdir.  So the inode item in the parent would
have to have a unique inode number that would get spit out in readdir, but then
if we stat''ed the directory we''d get 256 for the inode number.
Oh well,
incompat flag it is then.
> > === How do we want subvolumes to work from a user perspective? ==>
>
> > 1) Users need to be able to create their own subvolumes.  The
permission
> > semantics will be absolutely the same as creating directories, so I
don''t think
> > this is too tricky.  We want this because you can only take snapshots
of
> > subvolumes, and so it is important that users be able to create their
own
> > discrete snapshottable targets.
> > 
> > 2) Users need to be able to snapshot their subvolumes.  This is
basically the
> > same as #1, but it bears repeating.
> > 
> > 3) Subvolumes shouldn''t need to be specifically mounted. 
This is also
> > important, we don''t want users to have to go around mounting
their subvolumes up
> > manually one-by-one.  Today users just cd into subvolumes and it
works, just
> > like cd''ing into a directory.
> 
> And the separate nfsd exports is another thing I''d really love to
see
> work: currently you can export a subtree of a filesystem if you want,
> but it''s trivial to escape the subtree by guessing filehandles. 
So this
> gives us an easy way for administrators to create secure separate
> exports without having to manage entirely separate volumes.
> 
> If subvolumes got real mountpoints and so on, this would be easy.
Thats the idea, we''ll see how well it works out ;).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-01 20:00 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik
wrote:> Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> everything, but I forgot about readdir.  So the inode item in the parent
would
> have to have a unique inode number that would get spit out in readdir, but
then
> if we stat''ed the directory we''d get 256 for the inode
number.  Oh well,
> incompat flag it is then.
I think you''re already fine:

	# mkdir TMP
	# dd if=/dev/zero of=TMP-image bs=1M count=512
	# mkfs.btrfs TMP-image
	# mount -oloop TMP-image TMP/
	# btrfs subvolume create sub-a
	# btrfs subvolume create sub-b
	../readdir-inos .
	. 256 256
	.. 256 4130609
	sub-a 256 256
	sub-b 257 256

Where readdir-inos is my silly test program below, and the first number is from
readdir, the second from stat.

?

--b.

#include <stdio.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <dirent.h>

/* demonstrate that for mountpoints, readdir ino of mounted-on
 * directory, stat returns ino of mounted directory. */

int main(int argc, char *argv[])
{
	struct dirent *de;
	int ret;
	DIR *d;

	if (argc != 2)
		errx(1, "usage: %s <directory>", argv[0]);
	ret = chdir(argv[1]);
	if (ret)
		errx(1, "chdir /");
	d = opendir(".");
	if (!d)
		errx(1, "opendir .");
	while (de = readdir(d)) {
		struct stat st;

		ret = stat(de->d_name, &st);
		if (ret)
			errx(1, "stat %s", de->d_name);
		printf("%s %d %d\n", de->d_name, de->d_ino, st.st_ino);
	}
}

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeff Layton

2010-Dec-01 20:03 UTC

head link

Re: What to do about subvolumes?

On Wed, 1 Dec 2010 09:21:36 -0500
Josef Bacik <josef@redhat.com> wrote:
> There is one tricky thing.  When you create a subvolume, the directory
inode
> that is created in the parent subvolume has the inode number of 256.  So if
you
> have a bunch of subvolumes in the same parent subvolume, you are going to
have a
> bunch of directories with the inode number of 256.  This is so when users
cd
> into a subvolume we can know its a subvolume and do all the normal voodoo
to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
> 
> This is where things go a bit sideways.  We had serious problems with NFS,
but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.
A more common use case than CIFS or samba is going to be things like
backup programs. They commonly look at inode numbers in order to
identify hardlinks and may be horribly confused when there files that
have a link count >1 and inode number collisions with other files.

That probably qualifies as an "enterprise-ready" show stopper...
> === What do we do? ==> 
> This is where I expect to see the most discussion.  Here is what I want to
do
> 
> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
> to say "Hey, I''m a subvolume" and then we can do all of
the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but
the
> sooner we get this adressed the easier it will be in the long run. 
Obviously
> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> be broken and such.
> 
> 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> clear, so it will confuse people (and samba) when they walk into a
subvolume and
> all of a sudden the inode numbers are the same as in the directory behind
them.
> With doing the referral mount thing, each subvolume appears to be its own
mount
> and that way things like NFS and samba will work properly.
> 
Sounds like you''re on the right track.

The key concept is really that an inode number should be unique within
the scope of the st_dev. The simplest solution for you here is simply to
give each subvol its own st_dev and mount it up via a shrinkable mount
automagically when someone walks into the directory. In addition to the
examples of this in NFS, CIFS does this for DFS referrals.

Today, this is mostly done by hijacking the follow_link operation, but
David Howells proposed some patches a while back to do this via a more
formalized interface. It may be reasonable to target this work on top
of that, depending on the state of those changes...

-- 
Jeff Layton <jlayton@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-01 20:09 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields
wrote:> On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > Oh well crud, I was hoping that I could leave the inode numbers as 256
for
> > everything, but I forgot about readdir.  So the inode item in the
parent would
> > have to have a unique inode number that would get spit out in readdir,
but then
> > if we stat''ed the directory we''d get 256 for the
inode number.  Oh well,
> > incompat flag it is then.
> 
> I think you''re already fine:
> 
> 	# mkdir TMP
> 	# dd if=/dev/zero of=TMP-image bs=1M count=512
> 	# mkfs.btrfs TMP-image
> 	# mount -oloop TMP-image TMP/
> 	# btrfs subvolume create sub-a
> 	# btrfs subvolume create sub-b
> 	../readdir-inos .
> 	. 256 256
> 	.. 256 4130609
> 	sub-a 256 256
> 	sub-b 257 256
> 
> Where readdir-inos is my silly test program below, and the first number is
from
> readdir, the second from stat.
>
Heh as soon as I typed my email I went and actually looked at the code, looks
like for readdir we fill in the root id, which will be unique, so hotdamn we are
good and I don''t have to use a stupid incompat flag.  Thanks for
checking that
:),

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-01 20:16 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik
wrote:> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> > On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > > Oh well crud, I was hoping that I could leave the inode numbers
as 256 for
> > > everything, but I forgot about readdir.  So the inode item in the
parent would
> > > have to have a unique inode number that would get spit out in
readdir, but then
> > > if we stat''ed the directory we''d get 256 for
the inode number.  Oh well,
> > > incompat flag it is then.
> > 
> > I think you''re already fine:
> > 
> > 	# mkdir TMP
> > 	# dd if=/dev/zero of=TMP-image bs=1M count=512
> > 	# mkfs.btrfs TMP-image
> > 	# mount -oloop TMP-image TMP/
> > 	# btrfs subvolume create sub-a
> > 	# btrfs subvolume create sub-b
> > 	../readdir-inos .
> > 	. 256 256
> > 	.. 256 4130609
> > 	sub-a 256 256
> > 	sub-b 257 256
> > 
> > Where readdir-inos is my silly test program below, and the first
number is from
> > readdir, the second from stat.
> >
> 
> Heh as soon as I typed my email I went and actually looked at the code,
looks
> like for readdir we fill in the root id, which will be unique, so hotdamn
we are
> good and I don''t have to use a stupid incompat flag.  Thanks for
checking that
> :),
My only complaint was just about how you said this:

	"When you create a subvolume, the directory inode that is
	created in the parent subvolume has the inode number of 256"

If you revise that you might want to clarify.  (Maybe "Every subvolume
has a root directory inode with inode number 256"?)

The way you''ve stated it sounds like you''re talking about the
readdir-returned number, which would normally come from the inode that
has been covered up by the mount, and which really is an inode in the
parent filesystem....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Freddie Cash

2010-Dec-01 20:24 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk>
wrote:> On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
>> If you delete your subvolume A, like use the btrfs tool to delete it,
you will
>> only be stuck with what you changed in snapshot B.  So if you only
changed 5gig
>> worth of information, and you deleted the original subvolume, you would
have
>> 5gig charged to your quota.
>
>   This doesn''t work, though, if the owners of the
"original" and
> "new" subvolume are different:
>
> Case 1:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos''s data.
>  * A sysadmin (Richelieu) changes the ownership on Athos''s
snapshot of
>   Porthos''s data to Athos.
>  * Porthos deletes his copy of the data.
>
> Case 2:
>
>  * Porthos creates 10G of data.
>  * Athos makes a snapshot of Porthos''s data.
>  * Porthos deletes his copy of the data.
>  * A sysadmin (Richelieu) changes the ownership on Athos''s
snapshot of
>   Porthos''s data to Athos.
>
> Case 3:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos''s data.
>  * Aramis makes a snapshot of Porthos''s data.
>  * A sysadmin (Richelieu) changes the ownership on Athos''s
snapshot of
>   Porthos''s data to Athos.
>  * Porthos deletes his copy of the data.
>
> Case 4:
>
>  * Porthos creates 10G data.
>  * Athos makes a snapshot of Porthos''s data.
>  * Aramis makes a snapshot of Athos''s data.
>  * Porthos deletes his copy of the data.
>   [Consider also Richelieu changing ownerships of Athos''s and
Aramis''s
>   data at alternative points in this sequence]
>
>   In each of these, who gets charged (and how much) for their copy of
> the data?
>
>>  The idea is you are only charged for what blocks
>> you have on the disk.  Thanks,
>
>   My point was that it''s perfectly possible to have blocks on the
> disk that are effectively owned by two people, and that the person to
> charge for those blocks is, to me, far from clear. You either end up
> charging twice for a single set of blocks on the disk, or you end up
> in a situation where one person''s actions can cause another
person''s
> quota to fill up. Neither of these is particularly obvious behaviour.
As a sysadmin and as a user, quotas shouldn''t be about "physical
blocks of storage used" but should be about "logical storage
used".

IOW, if the filesystem is compressed, using 1 GB of physical space to
store 10 GB of data, my "quota used" should be 10 GB.

Similar for deduplication.  The quota is based on the storage *before*
the file is deduped.  Not after.

Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
their filesystem, then my "quota used" would be 10 GB as well.  As
data in my snapshot changes, my "quota used" is updated to reflect
that (change 1 GB of data compared to snapshot, use 1 GB of quota).

You have to (or at least should) keep two sets of stats for storage usage:
  - logical amount used ("real" file size, before compression, before
de-dupe, before snapshots, etc)
  - physical amount used (what''s actually written to disk)

User-level quotas are based on the logical storage used.
Admin-level quotas (if you want to implement them) would be based on
physical storage used.

Thus, the output of things like df, du, ls would show the "logical"
storage used and file sizes.  And you would either have an additional
option to those apps (--real or something) to show the "actual"
storage used and file sizes as stored on disk.

Trying to make quotas and disk usage utilities to work based on what''s
physically on disk is just backwards, imo.  And prone to a lot of
confusion.

-- 
Freddie Cash
fjwcash@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2010-Dec-01 20:46 UTC

head link

Re: What to do about subvolumes?

On Wednesday, 01 December, 2010, Jeff Layton wrote:> A more common use case than CIFS or samba is going to be things like
> backup programs. They commonly look at inode numbers in order to
> identify hardlinks and may be horribly confused when there files that
> have a link count >1 and inode number collisions with other files.
> 
> That probably qualifies as an "enterprise-ready" show stopper...
I hope that a backup program, uses the pair (inode,fsid) to identify if two 
file are hardlinked... otherwise a backup of two filesystem mounted can be 
quite danguerous...

From the statfs(2) man page:
[..]
The f_fsid field
[...]
The general idea is that f_fsid contains some random stuff such that the pair 
(f_fsid,ino) uniquely determines a file.  Some operating systems use (a 
variation on) the device number, or the device number combined  with  the  
file-system  type.   Several  OSes restrict giving out the f_fsid field to the 
superuser only (and zero it for unprivileged users), because this field is 
used in the filehandle of the file system when NFS-exported, and giving it out 
is a security concern.

And the btrfs_statfs function returns a different fsid for every subvolume.

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeff Layton

2010-Dec-01 21:06 UTC

head link

Re: What to do about subvolumes?

On Wed, 1 Dec 2010 21:46:03 +0100
Goffredo Baroncelli <kreijack@libero.it> wrote:
> On Wednesday, 01 December, 2010, Jeff Layton wrote:
> > A more common use case than CIFS or samba is going to be things like
> > backup programs. They commonly look at inode numbers in order to
> > identify hardlinks and may be horribly confused when there files that
> > have a link count >1 and inode number collisions with other files.
> > 
> > That probably qualifies as an "enterprise-ready" show
stopper...
> 
> I hope that a backup program, uses the pair (inode,fsid) to identify if two
> file are hardlinked... otherwise a backup of two filesystem mounted can be 
> quite danguerous...
> 
> 
> From the statfs(2) man page:
> [..]
> The f_fsid field
> [...]
> The general idea is that f_fsid contains some random stuff such that the
pair
> (f_fsid,ino) uniquely determines a file.  Some operating systems use (a 
> variation on) the device number, or the device number combined  with  the  
> file-system  type.   Several  OSes restrict giving out the f_fsid field to
the
> superuser only (and zero it for unprivileged users), because this field is 
> used in the filehandle of the file system when NFS-exported, and giving it
out
> is a security concern.
> 
> 
> And the btrfs_statfs function returns a different fsid for every subvolume.
> 
Ahh, interesting. I''ve never read that blurb on f_fsid...

Unfortunately, it looks like not all filesystems fill that field out.
NFS and CIFS leave it conspicuously blank. Those are probably bugs...

OTOH, the GLibc docs say this:

dev_t st_dev
    Identifies the device containing the file. The st_ino and st_dev,
    taken together, uniquely identify the file. The st_dev value is not
    necessarily consistent across reboots or system crashes, however. 

...and it''s always been my understanding that a st_dev/st_ino
combination should be unique.

Is there some definitive POSIX statement on why one should prefer to
use f_fsid over st_dev in this situation?

-- 
Jeff Layton <jlayton@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2010-Dec-01 21:28 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash
wrote:> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-lkml@carfax.org.uk>
wrote:
> >>  The idea is you are only charged for what blocks
> >> you have on the disk.  Thanks,
> >
> >   My point was that it''s perfectly possible to have blocks on
the
> > disk that are effectively owned by two people, and that the person to
> > charge for those blocks is, to me, far from clear. You either end up
> > charging twice for a single set of blocks on the disk, or you end up
> > in a situation where one person''s actions can cause another
person''s
> > quota to fill up. Neither of these is particularly obvious behaviour.
> 
> As a sysadmin and as a user, quotas shouldn''t be about
"physical
> blocks of storage used" but should be about "logical storage
used".
> 
> IOW, if the filesystem is compressed, using 1 GB of physical space to
> store 10 GB of data, my "quota used" should be 10 GB.
> 
> Similar for deduplication.  The quota is based on the storage *before*
> the file is deduped.  Not after.
> 
> Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
> their filesystem, then my "quota used" would be 10 GB as well. 
As
> data in my snapshot changes, my "quota used" is updated to
reflect
> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
   So if I''ve got 10G of data, and I snapshot it, I''ve just
used
another 10G of quota?
> You have to (or at least should) keep two sets of stats for storage usage:
>   - logical amount used ("real" file size, before compression,
before
> de-dupe, before snapshots, etc)
>   - physical amount used (what''s actually written to disk)
> 
> User-level quotas are based on the logical storage used.
> Admin-level quotas (if you want to implement them) would be based on
> physical storage used.
> 
> Thus, the output of things like df, du, ls would show the
"logical"
> storage used and file sizes.  And you would either have an additional
> option to those apps (--real or something) to show the "actual"
> storage used and file sizes as stored on disk.
> 
> Trying to make quotas and disk usage utilities to work based on
what''s
> physically on disk is just backwards, imo.  And prone to a lot of
> confusion.
   Trying to make quotas work based on what''s physically on the disk
appears to have serious issues on the semantics of "using up space",
so I agree with you on this point (and, indeed, it was the point I was
trying to make).

   However, doing it that way also effectively penalises users and
prevents (or severely discourages) them from using the advanced
functions of the filesystem. There''s no benefit (in disk usage terms)
to the user in using a snapshot -- they might as well use plain cp.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- I believe that it''s closely correlated with ---
                       the aeroswine coefficient.

Freddie Cash

2010-Dec-01 23:32 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-lkml@carfax.org.uk>
wrote:> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills
<hugo-lkml@carfax.org.uk> wrote:
>> >>  The idea is you are only charged for what blocks
>> >> you have on the disk.  Thanks,
>> >
>> >   My point was that it''s perfectly possible to have
blocks on the
>> > disk that are effectively owned by two people, and that the person
to
>> > charge for those blocks is, to me, far from clear. You either end
up
>> > charging twice for a single set of blocks on the disk, or you end
up
>> > in a situation where one person''s actions can cause
another person''s
>> > quota to fill up. Neither of these is particularly obvious
behaviour.
>>
>> As a sysadmin and as a user, quotas shouldn''t be about
"physical
>> blocks of storage used" but should be about "logical storage
used".
>>
>> IOW, if the filesystem is compressed, using 1 GB of physical space to
>> store 10 GB of data, my "quota used" should be 10 GB.
>>
>> Similar for deduplication.  The quota is based on the storage *before*
>> the file is deduped.  Not after.
>>
>> Similar for snapshots.  If UserA has 10 GB of quota used, I snapshot
>> their filesystem, then my "quota used" would be 10 GB as
well.  As
>> data in my snapshot changes, my "quota used" is updated to
reflect
>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>
>   So if I''ve got 10G of data, and I snapshot it, I''ve
just used
> another 10G of quota?
Sorry, forgot the "per user" bit above.

If UserA has 10 GB of data, then UserB snapshots it, UserB''s quota
usage is 10 GB.

If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
usage is used, as there is 0 difference between the snapshot and the
filesystem.  As UserA modifies data, their quota usage increases by
the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
== 11 GB quota usage).

If you combine the two scenarios, you end up with:
  - UserA has 10 GB of data == 10 GB quota usage
  - UserB snapshots UserA''s filesystem (clone), so UserB has 10 GB
quota usage (even though 0 blocks have changed on disk)
  - UserA snapshots UserA''s filesystem == no change to quota usage (no
blocks on disk have changed)
  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
the 10 GB in the snapshot)
  - UserB still only has 10 GB quota usage, since their snapshot
hasn''t changed (0 blocks changed)

If UserA deletes their filesystem and all their snapshots, freeing up
11 GB of quota usage on their account, UserB''s quota will still be 10
GB, and the blocks on the disk aren''t actually removed (still
referenced by UserB''s snapshot).

Basically, within a user''s account, only the data unique to a snapshot
should count toward the quota.

Across accounts, the original (root) snapshot would count completely
to the new user''s quota, and then only data unique to subsequent
snapshots would count.

I hope that makes it more clear.  :)  All the different layers and
whatnot get confusing.  :)

-- 
Freddie Cash
fjwcash@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael Vrable

2010-Dec-02 01:52 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik
wrote:> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
>> I think you''re already fine:
>> 
>> 	# mkdir TMP
>> 	# dd if=/dev/zero of=TMP-image bs=1M count=512
>> 	# mkfs.btrfs TMP-image
>> 	# mount -oloop TMP-image TMP/
>> 	# btrfs subvolume create sub-a
>> 	# btrfs subvolume create sub-b
>> 	../readdir-inos .
>> 	. 256 256
>> 	.. 256 4130609
>> 	sub-a 256 256
>> 	sub-b 257 256
>> 
>> Where readdir-inos is my silly test program below, and the first 
>> number is from readdir, the second from stat.
>> 
> 
> Heh as soon as I typed my email I went and actually looked at the 
> code, looks like for readdir we fill in the root id, which will be 
> unique, so hotdamn we are good and I don''t have to use a stupid 
> incompat flag.  Thanks for checking that :),
Except, aren''t the inode numbers within a filesystem and the sunbvolume
tree IDs allocated out of separate namespaces?  I don''t think
there''s
anything preventing a file/directory from having an inode number that 
clashes with one of the snapshots.

In fact, this already happens in the example above: "." (inode 256 in 
the root subvolume) and "sub-a" (subvolume ID 256).

(Though I still don''t understand the semantics well enough to say 
whether we need all the inode numbers returned by readdir to be 
distinct.)

--Michael Vrable
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Fedyk

2010-Dec-02 04:46 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 1, 2010 at 3:32 PM, Freddie Cash <fjwcash@gmail.com>
wrote:> On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-lkml@carfax.org.uk>
wrote:
>> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills
<hugo-lkml@carfax.org.uk> wrote:
>>> >>  The idea is you are only charged for what blocks
>>> >> you have on the disk.  Thanks,
>>> >
>>> >   My point was that it''s perfectly possible to have
blocks on the
>>> > disk that are effectively owned by two people, and that the
person to
>>> > charge for those blocks is, to me, far from clear. You either
end up
>>> > charging twice for a single set of blocks on the disk, or you
end up
>>> > in a situation where one person''s actions can cause
another person''s
>>> > quota to fill up. Neither of these is particularly obvious
behaviour.
>>>
>>> As a sysadmin and as a user, quotas shouldn''t be about
"physical
>>> blocks of storage used" but should be about "logical
storage used".
>>>
>>> IOW, if the filesystem is compressed, using 1 GB of physical space
to
>>> store 10 GB of data, my "quota used" should be 10 GB.
>>>
>>> Similar for deduplication.  The quota is based on the storage
*before*
>>> the file is deduped.  Not after.
>>>
>>> Similar for snapshots.  If UserA has 10 GB of quota used, I
snapshot
>>> their filesystem, then my "quota used" would be 10 GB as
well.  As
>>> data in my snapshot changes, my "quota used" is updated
to reflect
>>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>>
>>   So if I''ve got 10G of data, and I snapshot it, I''ve
just used
>> another 10G of quota?
>
> Sorry, forgot the "per user" bit above.
>
> If UserA has 10 GB of data, then UserB snapshots it, UserB''s quota
> usage is 10 GB.
>
> If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
> usage is used, as there is 0 difference between the snapshot and the
> filesystem.  As UserA modifies data, their quota usage increases by
> the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
> == 11 GB quota usage).
>
> If you combine the two scenarios, you end up with:
>  - UserA has 10 GB of data == 10 GB quota usage
>  - UserB snapshots UserA''s filesystem (clone), so UserB has 10 GB
> quota usage (even though 0 blocks have changed on disk)
Please define where the owner of a subvolume/snapshot is stored.

To my knowledge when you make a snapshot, you have the same set of
files with the same set of owners and groups.  Whatever user does the
snapshot this does not change this unless chown or chgrp are used.

Also a non-root user (or a process without CAP_whatever) should not be
able to snapshot a subvolume where the root directory of that
subvolume is not owned by the user attempting the snapshot.   If you
do not do so then you end up with the same security and quota issues
that hard links have when you don''t have separate filesystems.

You could have separate subvolumes for / and /home/foo and user foo
could snapshot / to /home/foo/exploit_later_001 and then foo can just
wait for an exploit to come along for one of the binaries or libs in
/home/foo/exploit_later_001 and own.

Yes, snapshot creation should be more restricted than hard links, for
good reason.

I have other questions but the answer to this fundamental game changer
may solve many of the mentioned issues.
>  - UserA snapshots UserA''s filesystem == no change to quota usage
(no
> blocks on disk have changed)
>  - UserA modifies 1 GB of data in the filesystem == 1 GB new quota
> usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
> the 10 GB in the snapshot)
>  - UserB still only has 10 GB quota usage, since their snapshot
> hasn''t changed (0 blocks changed)
>
> If UserA deletes their filesystem and all their snapshots, freeing up
> 11 GB of quota usage on their account, UserB''s quota will still be
10
> GB, and the blocks on the disk aren''t actually removed (still
> referenced by UserB''s snapshot).
>
> Basically, within a user''s account, only the data unique to a
snapshot
> should count toward the quota.
>
> Across accounts, the original (root) snapshot would count completely
> to the new user''s quota, and then only data unique to subsequent
> snapshots would count.
>
> I hope that makes it more clear.  :)  All the different layers and
> whatnot get confusing.  :)--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arne Jansen

2010-Dec-02 09:26 UTC

head link

Re: What to do about subvolumes?

Josef Bacik wrote:> 
> This is a huge topic in and of itself, but Christoph mentioned wanting to
have
> an idea of what we wanted to do with it, so I''m putting it here. 
There are
> really 2 things here
> 
> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
> subvolume and at creation time set a maximum size it can grow to and not
let it
> go farther than that.  Nice, simple and straightforward.
> 
I''d love to be able to limit the size of a subvolume. Here the size
comprises
all blocks this subvolume refers to.
But at least as important to me is a mode where one can build groups of sub-
volumes and snapshots and define a quota for the complete group. Again, the
size here comprises all blocks any of the subvolumes/snapshots refer to. If
a block is referred to more than once, it counts only once.
A subvolume/snapshot can be configured to be part of multiple groups.

With this I can do interesting things:
 a) The user pays only for the space he occupies, not for read-only snapshots
 b) The user pays for his space and for all the snapshots
 c) The user pays for his space and snapshots, but not for snapshots generated
    for internal backup purposes
 d) Hierarchical quotas. I can limit /home and set an additional quota on each
    homedir

Thanks,
Arne
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arne Jansen

2010-Dec-02 09:49 UTC

head link

Re: What to do about subvolumes?

Josef Bacik wrote:> 
> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
> to say "Hey, I''m a subvolume" and then we can do all of
the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but
the
> sooner we get this adressed the easier it will be in the long run. 
Obviously
> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> be broken and such.
> 
> 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> clear, so it will confuse people (and samba) when they walk into a
subvolume and
> all of a sudden the inode numbers are the same as in the directory behind
them.
> With doing the referral mount thing, each subvolume appears to be its own
mount
> and that way things like NFS and samba will work properly.
> 
What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.

Having one mount per subvolume/snapshots is the cleaner solution, but
quickly leads to situations where you have _lots_ of mounts, especially when
you export them via NFS and mount it somewhere else. I''ve seen a
machine
which had to handle > 100,000 mounts from a zfs server. This definitely
brings it''s own problems, so I''d love to see a full fs
exported as a single
mount. This will also keep output from tools like iostat (for nfs mounts)
and df readable.

Thanks,
Arne
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Dec-02 16:11 UTC

head link

Re: What to do about subvolumes?

Excerpts from Arne Jansen''s message of 2010-12-02 04:49:39
-0500:> Josef Bacik wrote:
> > 
> > 1) Scrap the 256 inode number thing.  Instead we''ll just put
a flag in the inode
> > to say "Hey, I''m a subvolume" and then we can do
all of the appropriate magic
> > that way.  This unfortunately will be an incompatible format change,
but the
> > sooner we get this adressed the easier it will be in the long run. 
Obviously
> > when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> > be broken and such.
> > 
> > 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> > just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> > clear, so it will confuse people (and samba) when they walk into a
subvolume and
> > all of a sudden the inode numbers are the same as in the directory
behind them.
> > With doing the referral mount thing, each subvolume appears to be its
own mount
> > and that way things like NFS and samba will work properly.
> > 
> 
> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.
The global inode number is possible, it''s just another btree that must
be maintained on disk in order to map which inodes are free and which
ones aren''t.  It also needs to have a reference count on each inode,
since each snapshot effectively increases the reference count on
every file and directory it contains.

The cost of maintaining that reference count is very very high.

-chris
> 
> Having one mount per subvolume/snapshots is the cleaner solution, but
> quickly leads to situations where you have _lots_ of mounts, especially
when
> you export them via NFS and mount it somewhere else. I''ve seen a
machine
> which had to handle > 100,000 mounts from a zfs server. This definitely
> brings it''s own problems, so I''d love to see a full fs
exported as a single
> mount. This will also keep output from tools like iostat (for nfs mounts)
> and df readable.
> 
> Thanks,
> Arne--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Pottage

2010-Dec-02 17:14 UTC

head link

Re: What to do about subvolumes?

On 02/12/10 16:11, Chris Mason wrote:> Excerpts from Arne Jansen''s message of 2010-12-02 04:49:39 -0500:
>    
>> Josef Bacik wrote:
>>      
>>> 1) Scrap the 256 inode number thing.  Instead we''ll just
put a flag in the inode
>>> to say "Hey, I''m a subvolume" and then we can do
all of the appropriate magic
>>> that way.  This unfortunately will be an incompatible format
change, but the
>>> sooner we get this adressed the easier it will be in the long run. 
Obviously
>>> when I say format change I mean via the incompat bits we have, so
old fs''s won''t
>>> be broken and such.
>>>
>>> 2) Do something like NFS''s referral mounts when we cd into
a subvolume.  Now we
>>> just do dentry trickery, but that doesn''t make the
boundary between subvolumes
>>> clear, so it will confuse people (and samba) when they walk into a
subvolume and
>>> all of a sudden the inode numbers are the same as in the directory
behind them.
>>> With doing the referral mount thing, each subvolume appears to be
its own mount
>>> and that way things like NFS and samba will work properly.
>>>
>>>        
>> What about the alternative and allocating inode numbers globally? The
only
>> problem would be with snapshots as they share the inum with the source,
but
>> one could just remap inode numbers in snapshots by sparing some bits at
the
>> top of this 64 bit field.
>>      
> The global inode number is possible, it''s just another btree that
must
> be maintained on disk in order to map which inodes are free and which
> ones aren''t.  It also needs to have a reference count on each
inode,
> since each snapshot effectively increases the reference count on
> every file and directory it contains.
>
> The cost of maintaining that reference count is very very high.
>    
A couple of years ago I was suffering from the problem of different 
files having the same inode number on Netapp servers. On a Netapp device 
if you snapshot a volume then the files in the snapshot have the same 
inode number as the original, even if the original changes. (Netapp 
snapshots are read only).

This means that if you attempt to see what has changed since your last 
snapshot using a command line such as:

diff src/file.c .snapshots/hourly.12/src.file.c

Then the diff tool will tell you that the files are the same even if 
they are different, because it is assuming that files with the same 
inode number will have identical contents.

Therefore I think it is a bad idea if potentially different files on 
btrfs can have the same inode number. It will break all sorts of tools.

Instead of maintaining a big complicated reference count of used inode 
numbers, could btrfs use bit masks to create a the userland visible 
inode number from the subvolume id and the real internal inode number. 
Something like:

userland_inode = ( volume_id << 48 ) & internal_inode;

Please forgive me if this is impossible, or if that C snippet is 
syntactically incorrect. I am not a filesystem or kernel developer, and 
I have not coded in C for many years.

-- 
David Pottage

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2010-Dec-03 02:43 UTC

head link

Re: What to do about subvolumes?

On 12/02/2010 04:49 AM, Arne Jansen wrote:> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.
I was wondering this as well.  Why give each subvol its own inode number 
space?  To avoid breaking assumptions of various programs, if they each 
have their own inode space, they must each have a unique st_dev.  How 
are inode numbers currently allocated, and why wouldn''t it be simple to
just have a single pool of inode numbers for all subvols?  It seems 
obvious to me that snapshots start out inheriting the inode numbers of 
the original subvol, but must be given a new st_dev.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Ball

2010-Dec-03 04:25 UTC

head link

Re: What to do about subvolumes?

Hi Josef,

   > 1) Scrap the 256 inode number thing.  Instead we''ll just put a
   > flag in the inode to say "Hey, I''m a subvolume" and
then we can
   > do all of the appropriate magic that way.  This unfortunately
   > will be an incompatible format change, but the sooner we get this
   > adressed the easier it will be in the long run.  Obviously when I
   > say format change I mean via the incompat bits we have, so old
   > fs''s won''t be broken and such.

Sorry if I''ve missed this elsewhere in the thread -- will we still
have an efficient operation for enumerating subvolumes and snapshots,
and how will that work?  We''re going to want tools like plymouth and
grub to be able to list all snapshots without running a large scan.

Thanks,

- Chris.
-- 
Chris Ball   <cjb@laptop.org>
One Laptop Per Child
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paweł Brodacki

2010-Dec-03 13:47 UTC

head link

Fwd: What to do about subvolumes?

2010/12/2 David Pottage
<david@electric-spoon.com>:>
> Therefore I think it is a bad idea if potentially different files on btrfs
> can have the same inode number. It will break all sorts of tools.
>
> Instead of maintaining a big complicated reference count of used inode
> numbers, could btrfs use bit masks to create a the userland visible inode
> number from the subvolume id and the real internal inode number. Something
> like:
>
> userland_inode = ( volume_id << 48 ) & internal_inode;
>
> Please forgive me if this is impossible, or if that C snippet is
> syntactically incorrect. I am not a filesystem or kernel developer, and I
> have not coded in C for many years.
>
> --
> David Pottage
>
Expanding on the idea: what about a pool of IDs for subvolumes and
inode numbers inside a subvolume having the subvolume ID as a prefix?
It gives each inode a unique number, doesn''t require cheating the
userland and is less costly than keeping reference count for each
inode. The obvious downside that I can see is limitation on number of
subvolumes that it would be possible to create. It also lowers the
maximum number of inodes in a filesystem (because of bits taken up by
subvolume ID). I expect there are also less-than obvious downsides.

Just an idea by a kernel and FS ignorant.

--
Paweł Brodacki
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-03 14:00 UTC

head link

Re: What to do about subvolumes?

On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball
wrote:> Hi Josef,
> 
>    > 1) Scrap the 256 inode number thing.  Instead we''ll just
put a
>    > flag in the inode to say "Hey, I''m a subvolume"
and then we can
>    > do all of the appropriate magic that way.  This unfortunately
>    > will be an incompatible format change, but the sooner we get this
>    > adressed the easier it will be in the long run.  Obviously when I
>    > say format change I mean via the incompat bits we have, so old
>    > fs''s won''t be broken and such.
> 
> Sorry if I''ve missed this elsewhere in the thread -- will we still
> have an efficient operation for enumerating subvolumes and snapshots,
> and how will that work?  We''re going to want tools like plymouth
and
> grub to be able to list all snapshots without running a large scan.
>
Yeah the idea is we want to fix the problems with the design without breaking
anything that currently works.  So all the changes I want to make are going to
be invisible for the user.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-03 20:53 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 05:52:07PM -0800, Michael Vrable
wrote:> On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
> >On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> >>I think you''re already fine:
> >>
> >>	# mkdir TMP
> >>	# dd if=/dev/zero of=TMP-image bs=1M count=512
> >>	# mkfs.btrfs TMP-image
> >>	# mount -oloop TMP-image TMP/
> >>	# btrfs subvolume create sub-a
> >>	# btrfs subvolume create sub-b
> >>	../readdir-inos .
> >>	. 256 256
> >>	.. 256 4130609
> >>	sub-a 256 256
> >>	sub-b 257 256
> >>
> >>Where readdir-inos is my silly test program below, and the first
> >>number is from readdir, the second from stat.
> >>
> >
> >Heh as soon as I typed my email I went and actually looked at the
> >code, looks like for readdir we fill in the root id, which will be
> >unique, so hotdamn we are good and I don''t have to use a
stupid
> >incompat flag.  Thanks for checking that :),
> 
> Except, aren''t the inode numbers within a filesystem and the
> sunbvolume tree IDs allocated out of separate namespaces?  I don''t
> think there''s anything preventing a file/directory from having an
> inode number that clashes with one of the snapshots.
> 
> In fact, this already happens in the example above: "." (inode
256
> in the root subvolume) and "sub-a" (subvolume ID 256).
Oof, yes, I overlooked that.
> (Though I still don''t understand the semantics well enough to say
> whether we need all the inode numbers returned by readdir to be
> distinct.)
On normal mounts they''re the number of the inode that was mounted over,
so normally they''d be unique across the parent filesystem.....  I
don''t
know if anything depends on that.

--b.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-03 20:56 UTC

head link

Re: What to do about subvolumes?

On Thu, Dec 02, 2010 at 05:14:53PM +0000, David Pottage
wrote:> A couple of years ago I was suffering from the problem of different
> files having the same inode number on Netapp servers. On a Netapp
> device if you snapshot a volume then the files in the snapshot have
> the same inode number as the original, even if the original changes.
> (Netapp snapshots are read only).
> 
> This means that if you attempt to see what has changed since your
> last snapshot using a command line such as:
> 
> diff src/file.c .snapshots/hourly.12/src.file.c
> 
> Then the diff tool will tell you that the files are the same even if
> they are different, because it is assuming that files with the same
> inode number will have identical contents.
diff should also recognize when they''re on different filesystem, so
this
should also be fixable if subvolumes are treated as different filesystem
(in the sense that they have different vfsmounts and fsid''s).

--b.
> 
> Therefore I think it is a bad idea if potentially different files on
> btrfs can have the same inode number. It will break all sorts of
> tools.
> 
> Instead of maintaining a big complicated reference count of used
> inode numbers, could btrfs use bit masks to create a the userland
> visible inode number from the subvolume id and the real internal
> inode number. Something like:
> 
> userland_inode = ( volume_id << 48 ) & internal_inode;
> 
> Please forgive me if this is impossible, or if that C snippet is
> syntactically incorrect. I am not a filesystem or kernel developer,
> and I have not coded in C for many years.
> 
> -- 
> David Pottage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-03 21:45 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik
wrote:> Hello,
> 
> Various people have complained about how BTRFS deals with subvolumes
recently,
> specifically the fact that they all have the same inode number, and
there''s no
> discrete seperation from one subvolume to another.  Christoph asked that I
lay
> out a basic design document of how we want subvolumes to work so we can
hash
> everything out now, fix what is broken, and then move forward with a design
that
> everybody is more or less happy with.  I apologize in advance for how
freaking
> long this email is going to be.  I assume that most people are generally
> familiar with how BTRFS works, so I''m not going to bother
explaining in great
> detail some stuff.
> 
> === What are subvolumes? ==> 
> They are just another tree.  In BTRFS we have various b-trees to describe
the
> filesystem.  A few of them are filesystem wide, such as the extent tree,
chunk
> tree, root tree etc.  The tree''s that hold the actual filesystem
data, that is
> inodes and such, are kept in their own b-tree.  This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file
data
> contained within them.
> 
> === What do subvolumes look like? ==> 
> All the user sees are directories.  They act like any other directory acts,
with
> a few exceptions
> 
> 1) You cannot hardlink between subvolumes.  This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this
case,
> you cannot hardlink between two mounts because the link needs to point to
the
> same on disk inode, which is impossible between two different filesystems. 
The
> same is true for subvolumes, they have their own trees with their own
inodes and
> inode numbers, so it''s impossible to hardlink between them.
> 
> 1a) In case it wasn''t clear from above, each subvolume has their
own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
> 
> 2) Obviously you can''t just rm -rf subvolumes.  Because they are
roots there''s
> extra metadata to keep track of them, so you have to use one of our ioctls
to
> delete subvolumes/snapshots.
> 
> But permissions and everything else they are the same.
> 
> There is one tricky thing.  When you create a subvolume, the directory
inode
> that is created in the parent subvolume has the inode number of 256.  So if
you
> have a bunch of subvolumes in the same parent subvolume, you are going to
have a
> bunch of directories with the inode number of 256.  This is so when users
cd
> into a subvolume we can know its a subvolume and do all the normal voodoo
to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
> 
> This is where things go a bit sideways.  We had serious problems with NFS,
but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.
> 
> === How do we want subvolumes to work from a user perspective? ==> 
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I
don''t think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
> 
> 2) Users need to be able to snapshot their subvolumes.  This is basically
the
> same as #1, but it bears repeating.
> 
> 3) Subvolumes shouldn''t need to be specifically mounted.  This is
also
> important, we don''t want users to have to go around mounting their
subvolumes up
> manually one-by-one.  Today users just cd into subvolumes and it works,
just
> like cd''ing into a directory.
> 
> === Quotas ==> 
> This is a huge topic in and of itself, but Christoph mentioned wanting to
have
> an idea of what we wanted to do with it, so I''m putting it here. 
There are
> really 2 things here
> 
> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
> subvolume and at creation time set a maximum size it can grow to and not
let it
> go farther than that.  Nice, simple and straightforward.
> 
> 2) Normal quotas, via the quota tools.  This just comes down to how do we
want
> to charge users, do we want to do it per subvolume, or per filesystem.  My
vote
> is per filesystem.  Obviously this will make it tricky with snapshots, but
I
> think if we''re just charging the diff''s between the
original volume and the
> snapshot to the user then that will be the easiest for people to
understand,
> rather than making a snapshot all of a sudden count the users currently
used
> quota * 2.
> 
> === What do we do? ==> 
> This is where I expect to see the most discussion.  Here is what I want to
do
> 
> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
> to say "Hey, I''m a subvolume" and then we can do all of
the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but
the
> sooner we get this adressed the easier it will be in the long run. 
Obviously
> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> be broken and such.
> 
> 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> clear, so it will confuse people (and samba) when they walk into a
subvolume and
> all of a sudden the inode numbers are the same as in the directory behind
them.
> With doing the referral mount thing, each subvolume appears to be its own
mount
> and that way things like NFS and samba will work properly.
> 
> I feel like I''m forgetting something here, hopefully somebody will
point it out.
> 
> === Conclusion ==> 
> There are definitely some wonky things with subvolumes, but I
don''t think they
> are things that cannot be fixed now.  Some of these changes will require
> incompat format changes, but it''s either we fix it now, or later
on down the
> road when BTRFS starts getting used in production really find out how many
> things our current scheme breaks and then have to do the changes then. 
Thanks,
> 
So now that I''ve actually looked at everything, it looks like the
semantics are
all right for subvolumes

1) readdir - we return the root id in d_ino, which is unique across the fs
2) stat - we return 256 for all subvolumes, because that is their inode number
3) dev_t - we setup an anon super for all volumes, so they all get their own
dev_t, which is set properly for all of their children, see below

[root@test1244 btrfs-test]# stat .
  File: `.''
  Size: 20              Blocks: 8          IO Block: 4096   directory
Device: 15h/21d Inode: 256         Links: 1
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-12-03 15:35:41.931679393 -0500
Modify: 2010-12-03 15:35:20.405679493 -0500
Change: 2010-12-03 15:35:20.405679493 -0500

[root@test1244 btrfs-test]# stat foo
  File: `foo''
  Size: 12              Blocks: 0          IO Block: 4096   directory
Device: 19h/25d Inode: 256         Links: 1
Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-12-03 15:35:17.501679393 -0500
Modify: 2010-12-03 15:35:59.150680051 -0500
Change: 2010-12-03 15:35:59.150680051 -0500

[root@test1244 btrfs-test]# stat foo/foobar 
  File: `foo/foobar''
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 19h/25d Inode: 257         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-12-03 15:35:59.150680051 -0500
Modify: 2010-12-03 15:35:59.150680051 -0500
Change: 2010-12-03 15:35:59.150680051 -0500

So as far as the user is concerned, everything should come out right.  Obviously
we had to do the NFS trickery still because as far as VFS is concerned the
subvolumes are all on the same mount.  So the question is this (and really this
is directed at Christoph and Bruce and anybody else who may care), is this good
enough, or do we want to have a seperate vfsmount for each subvolume?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-03 22:16 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik
wrote:> So now that I''ve actually looked at everything, it looks like the
semantics are
> all right for subvolumes
> 
> 1) readdir - we return the root id in d_ino, which is unique across the fs
Though Michael Vrable pointed out an apparent collision with "normal"
inode numbers on the parent filesystem?
> 2) stat - we return 256 for all subvolumes, because that is their inode
number
> 3) dev_t - we setup an anon super for all volumes, so they all get their
own
> dev_t, which is set properly for all of their children, see below
> 
> [root@test1244 btrfs-test]# stat .
>   File: `.''
>   Size: 20              Blocks: 8          IO Block: 4096   directory
> Device: 15h/21d Inode: 256         Links: 1
> Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:41.931679393 -0500
> Modify: 2010-12-03 15:35:20.405679493 -0500
> Change: 2010-12-03 15:35:20.405679493 -0500
> 
> [root@test1244 btrfs-test]# stat foo
>   File: `foo''
>   Size: 12              Blocks: 0          IO Block: 4096   directory
> Device: 19h/25d Inode: 256         Links: 1
> Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:17.501679393 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
> 
> [root@test1244 btrfs-test]# stat foo/foobar 
>   File: `foo/foobar''
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty
file
> Device: 19h/25d Inode: 257         Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:59.150680051 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
> 
> So as far as the user is concerned, everything should come out right. 
Obviously
> we had to do the NFS trickery still because as far as VFS is concerned the
> subvolumes are all on the same mount.  So the question is this (and really
this
> is directed at Christoph and Bruce and anybody else who may care), is this
good
> enough, or do we want to have a seperate vfsmount for each subvolume? 
Thanks,
For nfsd''s purposes, we need to be able find out about filesystems in
two different ways:

	1. Lookup by filehandle: we need to be able to identify which
	subvolume we''re dealing with from a filehandle.
	2. Lookup by path: we need to notice when we cross into a
	subvolume.

Looks like #1 already works.  Not #2: the current nfsd code just checks
for mountpoints.  We could modify nfsd to also check whether dev_t
changed each time it did a lookup.  I suppose it would work, though
it''s
annoying to have to do it just for the case of btrfs.

As far as I can tell, crossing into a subvolume is like crossing a
mountpoint in every way except for the lack of a separate vfsmount. 
I''d
worry that the inconsistency will end up requiring more special cases
down the road, but I don''t have any in mind.

--b.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner

2010-Dec-03 22:27 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > Hello,
> > 
> > Various people have complained about how BTRFS deals with subvolumes
recently,
> > specifically the fact that they all have the same inode number, and
there''s no
> > discrete seperation from one subvolume to another.  Christoph asked
that I lay
> > out a basic design document of how we want subvolumes to work so we
can hash
> > everything out now, fix what is broken, and then move forward with a
design that
> > everybody is more or less happy with.  I apologize in advance for how
freaking
> > long this email is going to be.  I assume that most people are
generally
> > familiar with how BTRFS works, so I''m not going to bother
explaining in great
> > detail some stuff.
> > 
....> > are things that cannot be fixed now.  Some of these changes will
require
> > incompat format changes, but it''s either we fix it now, or
later on down the
> > road when BTRFS starts getting used in production really find out how
many
> > things our current scheme breaks and then have to do the changes then.
Thanks,
> > 
> 
> So now that I''ve actually looked at everything, it looks like the
semantics are
> all right for subvolumes
> 
> 1) readdir - we return the root id in d_ino, which is unique across the fs
> 2) stat - we return 256 for all subvolumes, because that is their inode
number
> 3) dev_t - we setup an anon super for all volumes, so they all get their
own
> dev_t, which is set properly for all of their children, see below
A property of NFS fileshandles is that they must be stable across
server reboots. Is this anon dev_t used as part of the NFS
filehandle and if so how can you guarantee that it is stable?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Dec-03 22:29 UTC

head link

Re: What to do about subvolumes?

Excerpts from Dave Chinner''s message of 2010-12-03 17:27:56
-0500:> On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > Hello,
> > > 
> > > Various people have complained about how BTRFS deals with
subvolumes recently,
> > > specifically the fact that they all have the same inode number,
and there''s no
> > > discrete seperation from one subvolume to another.  Christoph
asked that I lay
> > > out a basic design document of how we want subvolumes to work so
we can hash
> > > everything out now, fix what is broken, and then move forward
with a design that
> > > everybody is more or less happy with.  I apologize in advance for
how freaking
> > > long this email is going to be.  I assume that most people are
generally
> > > familiar with how BTRFS works, so I''m not going to
bother explaining in great
> > > detail some stuff.
> > > 
> ....
> > > are things that cannot be fixed now.  Some of these changes will
require
> > > incompat format changes, but it''s either we fix it now,
or later on down the
> > > road when BTRFS starts getting used in production really find out
how many
> > > things our current scheme breaks and then have to do the changes
then.  Thanks,
> > > 
> > 
> > So now that I''ve actually looked at everything, it looks like
the semantics are
> > all right for subvolumes
> > 
> > 1) readdir - we return the root id in d_ino, which is unique across
the fs
> > 2) stat - we return 256 for all subvolumes, because that is their
inode number
> > 3) dev_t - we setup an anon super for all volumes, so they all get
their own
> > dev_t, which is set properly for all of their children, see below
> 
> A property of NFS fileshandles is that they must be stable across
> server reboots. Is this anon dev_t used as part of the NFS
> filehandle and if so how can you guarantee that it is stable?
It isn''t today, that''s something we''ll have to
address.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-03 22:45 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason
wrote:> Excerpts from Dave Chinner''s message of 2010-12-03 17:27:56 -0500:
> > On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > > Hello,
> > > > 
> > > > Various people have complained about how BTRFS deals with
subvolumes recently,
> > > > specifically the fact that they all have the same inode
number, and there''s no
> > > > discrete seperation from one subvolume to another. 
Christoph asked that I lay
> > > > out a basic design document of how we want subvolumes to
work so we can hash
> > > > everything out now, fix what is broken, and then move
forward with a design that
> > > > everybody is more or less happy with.  I apologize in
advance for how freaking
> > > > long this email is going to be.  I assume that most people
are generally
> > > > familiar with how BTRFS works, so I''m not going to
bother explaining in great
> > > > detail some stuff.
> > > > 
> > ....
> > > > are things that cannot be fixed now.  Some of these changes
will require
> > > > incompat format changes, but it''s either we fix it
now, or later on down the
> > > > road when BTRFS starts getting used in production really
find out how many
> > > > things our current scheme breaks and then have to do the
changes then.  Thanks,
> > > > 
> > > 
> > > So now that I''ve actually looked at everything, it looks
like the semantics are
> > > all right for subvolumes
> > > 
> > > 1) readdir - we return the root id in d_ino, which is unique
across the fs
> > > 2) stat - we return 256 for all subvolumes, because that is their
inode number
> > > 3) dev_t - we setup an anon super for all volumes, so they all
get their own
> > > dev_t, which is set properly for all of their children, see below
> > 
> > A property of NFS fileshandles is that they must be stable across
> > server reboots. Is this anon dev_t used as part of the NFS
> > filehandle and if so how can you guarantee that it is stable?
> 
> It isn''t today, that''s something we''ll have to
address.
We''re using statfs64.fs_fsid for this; I believe that''s both
stable
across reboots and distinguishes between subvolumes, so that''s OK.

(That said, since fs_fsid doesn''t work for other filesystems, we depend
on an explicit check for a filesystem type of "btrfs", which is
awful--btrfs won''t always be the only filesystem that wants to do this
kind of thing, etc.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2010-Dec-03 23:01 UTC

head link

Re: What to do about subvolumes?

On 2010-12-03, at 15:45, J. Bruce Fields wrote:> We''re using statfs64.fs_fsid for this; I believe that''s
both stable
> across reboots and distinguishes between subvolumes, so that''s OK.
> 
> (That said, since fs_fsid doesn''t work for other filesystems, we
depend
> on an explicit check for a filesystem type of "btrfs", which is
> awful--btrfs won''t always be the only filesystem that wants to do
this
> kind of thing, etc.)
Sigh, I wanted to be able to specify the NFS FSID directly from within the
kernel for Lustre many years already.  Glad to see that this is moving forward.

Any chance we can add a ->get_fsid(sb, inode) method to export_operations
(or something simiar), that allows the filesystem to generate an FSID based on
the volume and inode that is being exported?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Fedyk

2010-Dec-04 21:58 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com>
wrote:> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> Hello,
>>
>> Various people have complained about how BTRFS deals with subvolumes
recently,
>> specifically the fact that they all have the same inode number, and
there''s no
>> discrete seperation from one subvolume to another.  Christoph asked
that I lay
>> out a basic design document of how we want subvolumes to work so we can
hash
>> everything out now, fix what is broken, and then move forward with a
design that
>> everybody is more or less happy with.  I apologize in advance for how
freaking
>> long this email is going to be.  I assume that most people are
generally
>> familiar with how BTRFS works, so I''m not going to bother
explaining in great
>> detail some stuff.
>>
>> === What are subvolumes? ==>>
>> They are just another tree.  In BTRFS we have various b-trees to
describe the
>> filesystem.  A few of them are filesystem wide, such as the extent
tree, chunk
>> tree, root tree etc.  The tree''s that hold the actual
filesystem data, that is
>> inodes and such, are kept in their own b-tree.  This is how subvolumes
and
>> snapshots appear on disk, they are simply new b-trees with all of the
file data
>> contained within them.
>>
>> === What do subvolumes look like? ==>>
>> All the user sees are directories.  They act like any other directory
acts, with
>> a few exceptions
>>
>> 1) You cannot hardlink between subvolumes.  This is because subvolumes
have
>> their own inode numbers and such, think of them as seperate mounts in
this case,
>> you cannot hardlink between two mounts because the link needs to point
to the
>> same on disk inode, which is impossible between two different
filesystems.  The
>> same is true for subvolumes, they have their own trees with their own
inodes and
>> inode numbers, so it''s impossible to hardlink between them.
>>
>> 1a) In case it wasn''t clear from above, each subvolume has
their own inode
>> numbers, so you can have the same inode numbers used between two
different
>> subvolumes, since they are two different trees.
>>
>> 2) Obviously you can''t just rm -rf subvolumes.  Because they
are roots there''s
>> extra metadata to keep track of them, so you have to use one of our
ioctls to
>> delete subvolumes/snapshots.
>>
>> But permissions and everything else they are the same.
>>
>> There is one tricky thing.  When you create a subvolume, the directory
inode
>> that is created in the parent subvolume has the inode number of 256.
 So if you
>> have a bunch of subvolumes in the same parent subvolume, you are going
to have a
>> bunch of directories with the inode number of 256.  This is so when
users cd
>> into a subvolume we can know its a subvolume and do all the normal
voodoo to
>> start looking in the subvolumes tree instead of the parent subvolumes
tree.
>>
>> This is where things go a bit sideways.  We had serious problems with
NFS, but
>> thankfully NFS gives us a bunch of hooks to get around these problems.
>> CIFS/Samba do not, so we will have problems there, not to mention any
other
>> userspace application that looks at inode numbers.
>>
>> === How do we want subvolumes to work from a user perspective?
==>>
>> 1) Users need to be able to create their own subvolumes.  The
permission
>> semantics will be absolutely the same as creating directories, so I
don''t think
>> this is too tricky.  We want this because you can only take snapshots
of
>> subvolumes, and so it is important that users be able to create their
own
>> discrete snapshottable targets.
>>
>> 2) Users need to be able to snapshot their subvolumes.  This is
basically the
>> same as #1, but it bears repeating.
>>
>> 3) Subvolumes shouldn''t need to be specifically mounted.  This
is also
>> important, we don''t want users to have to go around mounting
their subvolumes up
>> manually one-by-one.  Today users just cd into subvolumes and it works,
just
>> like cd''ing into a directory.
>>
>> === Quotas ==>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting
to have
>> an idea of what we wanted to do with it, so I''m putting it
here.  There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes.  This is really easy for us, just
create a
>> subvolume and at creation time set a maximum size it can grow to and
not let it
>> go farther than that.  Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools.  This just comes down to how do
we want
>> to charge users, do we want to do it per subvolume, or per filesystem.
 My vote
>> is per filesystem.  Obviously this will make it tricky with snapshots,
but I
>> think if we''re just charging the diff''s between the
original volume and the
>> snapshot to the user then that will be the easiest for people to
understand,
>> rather than making a snapshot all of a sudden count the users currently
used
>> quota * 2.
>>
>> === What do we do? ==>>
>> This is where I expect to see the most discussion.  Here is what I want
to do
>>
>> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
>> to say "Hey, I''m a subvolume" and then we can do all
of the appropriate magic
>> that way.  This unfortunately will be an incompatible format change,
but the
>> sooner we get this adressed the easier it will be in the long run.
 Obviously
>> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
>> be broken and such.
>>
>> 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
>> just do dentry trickery, but that doesn''t make the boundary
between subvolumes
>> clear, so it will confuse people (and samba) when they walk into a
subvolume and
>> all of a sudden the inode numbers are the same as in the directory
behind them.
>> With doing the referral mount thing, each subvolume appears to be its
own mount
>> and that way things like NFS and samba will work properly.
>>
>> I feel like I''m forgetting something here, hopefully somebody
will point it out.
>>
>> === Conclusion ==>>
>> There are definitely some wonky things with subvolumes, but I
don''t think they
>> are things that cannot be fixed now.  Some of these changes will
require
>> incompat format changes, but it''s either we fix it now, or
later on down the
>> road when BTRFS starts getting used in production really find out how
many
>> things our current scheme breaks and then have to do the changes then.
 Thanks,
>>
>
> So now that I''ve actually looked at everything, it looks like the
semantics are
> all right for subvolumes
>
> 1) readdir - we return the root id in d_ino, which is unique across the fs
> 2) stat - we return 256 for all subvolumes, because that is their inode
number
> 3) dev_t - we setup an anon super for all volumes, so they all get their
own
> dev_t, which is set properly for all of their children, see below
>
> [root@test1244 btrfs-test]# stat .
>  File: `.''
>  Size: 20              Blocks: 8          IO Block: 4096   directory
> Device: 15h/21d Inode: 256         Links: 1
> Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:41.931679393 -0500
> Modify: 2010-12-03 15:35:20.405679493 -0500
> Change: 2010-12-03 15:35:20.405679493 -0500
>
> [root@test1244 btrfs-test]# stat foo
>  File: `foo''
>  Size: 12              Blocks: 0          IO Block: 4096   directory
> Device: 19h/25d Inode: 256         Links: 1
> Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:17.501679393 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> [root@test1244 btrfs-test]# stat foo/foobar
>  File: `foo/foobar''
>  Size: 0               Blocks: 0          IO Block: 4096   regular empty
file
> Device: 19h/25d Inode: 257         Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:59.150680051 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> So as far as the user is concerned, everything should come out right.
 Obviously
> we had to do the NFS trickery still because as far as VFS is concerned the
> subvolumes are all on the same mount.  So the question is this (and really
this
> is directed at Christoph and Bruce and anybody else who may care), is this
good
> enough, or do we want to have a seperate vfsmount for each subvolume?
 Thanks,
>
What are the drawbacks of having a vfsmount for each subvolume?

Why (besides having to code it up) are you trying to avoid doing it that way?
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-Dec-06 14:27 UTC

head link

Re: What to do about subvolumes?

On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk
wrote:> On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com> wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> >> Hello,
> >>
> >> Various people have complained about how BTRFS deals with
subvolumes recently,
> >> specifically the fact that they all have the same inode number,
and there''s no
> >> discrete seperation from one subvolume to another.  Christoph
asked that I lay
> >> out a basic design document of how we want subvolumes to work so
we can hash
> >> everything out now, fix what is broken, and then move forward with
a design that
> >> everybody is more or less happy with.  I apologize in advance for
how freaking
> >> long this email is going to be.  I assume that most people are
generally
> >> familiar with how BTRFS works, so I''m not going to bother
explaining in great
> >> detail some stuff.
> >>
> >> === What are subvolumes? ==> >>
> >> They are just another tree.  In BTRFS we have various b-trees to
describe the
> >> filesystem.  A few of them are filesystem wide, such as the extent
tree, chunk
> >> tree, root tree etc.  The tree''s that hold the actual
filesystem data, that is
> >> inodes and such, are kept in their own b-tree.  This is how
subvolumes and
> >> snapshots appear on disk, they are simply new b-trees with all of
the file data
> >> contained within them.
> >>
> >> === What do subvolumes look like? ==> >>
> >> All the user sees are directories.  They act like any other
directory acts, with
> >> a few exceptions
> >>
> >> 1) You cannot hardlink between subvolumes.  This is because
subvolumes have
> >> their own inode numbers and such, think of them as seperate mounts
in this case,
> >> you cannot hardlink between two mounts because the link needs to
point to the
> >> same on disk inode, which is impossible between two different
filesystems.  The
> >> same is true for subvolumes, they have their own trees with their
own inodes and
> >> inode numbers, so it''s impossible to hardlink between
them.
> >>
> >> 1a) In case it wasn''t clear from above, each subvolume
has their own inode
> >> numbers, so you can have the same inode numbers used between two
different
> >> subvolumes, since they are two different trees.
> >>
> >> 2) Obviously you can''t just rm -rf subvolumes.  Because
they are roots there''s
> >> extra metadata to keep track of them, so you have to use one of
our ioctls to
> >> delete subvolumes/snapshots.
> >>
> >> But permissions and everything else they are the same.
> >>
> >> There is one tricky thing.  When you create a subvolume, the
directory inode
> >> that is created in the parent subvolume has the inode number of
256.  So if you
> >> have a bunch of subvolumes in the same parent subvolume, you are
going to have a
> >> bunch of directories with the inode number of 256.  This is so
when users cd
> >> into a subvolume we can know its a subvolume and do all the normal
voodoo to
> >> start looking in the subvolumes tree instead of the parent
subvolumes tree.
> >>
> >> This is where things go a bit sideways.  We had serious problems
with NFS, but
> >> thankfully NFS gives us a bunch of hooks to get around these
problems.
> >> CIFS/Samba do not, so we will have problems there, not to mention
any other
> >> userspace application that looks at inode numbers.
> >>
> >> === How do we want subvolumes to work from a user perspective?
==> >>
> >> 1) Users need to be able to create their own subvolumes.  The
permission
> >> semantics will be absolutely the same as creating directories, so
I don''t think
> >> this is too tricky.  We want this because you can only take
snapshots of
> >> subvolumes, and so it is important that users be able to create
their own
> >> discrete snapshottable targets.
> >>
> >> 2) Users need to be able to snapshot their subvolumes.  This is
basically the
> >> same as #1, but it bears repeating.
> >>
> >> 3) Subvolumes shouldn''t need to be specifically mounted.
 This is also
> >> important, we don''t want users to have to go around
mounting their subvolumes up
> >> manually one-by-one.  Today users just cd into subvolumes and it
works, just
> >> like cd''ing into a directory.
> >>
> >> === Quotas ==> >>
> >> This is a huge topic in and of itself, but Christoph mentioned
wanting to have
> >> an idea of what we wanted to do with it, so I''m putting
it here.  There are
> >> really 2 things here
> >>
> >> 1) Limiting the size of subvolumes.  This is really easy for us,
just create a
> >> subvolume and at creation time set a maximum size it can grow to
and not let it
> >> go farther than that.  Nice, simple and straightforward.
> >>
> >> 2) Normal quotas, via the quota tools.  This just comes down to
how do we want
> >> to charge users, do we want to do it per subvolume, or per
filesystem.  My vote
> >> is per filesystem.  Obviously this will make it tricky with
snapshots, but I
> >> think if we''re just charging the diff''s between
the original volume and the
> >> snapshot to the user then that will be the easiest for people to
understand,
> >> rather than making a snapshot all of a sudden count the users
currently used
> >> quota * 2.
> >>
> >> === What do we do? ==> >>
> >> This is where I expect to see the most discussion.  Here is what I
want to do
> >>
> >> 1) Scrap the 256 inode number thing.  Instead we''ll just
put a flag in the inode
> >> to say "Hey, I''m a subvolume" and then we can
do all of the appropriate magic
> >> that way.  This unfortunately will be an incompatible format
change, but the
> >> sooner we get this adressed the easier it will be in the long run.
 Obviously
> >> when I say format change I mean via the incompat bits we have, so
old fs''s won''t
> >> be broken and such.
> >>
> >> 2) Do something like NFS''s referral mounts when we cd
into a subvolume.  Now we
> >> just do dentry trickery, but that doesn''t make the
boundary between subvolumes
> >> clear, so it will confuse people (and samba) when they walk into a
subvolume and
> >> all of a sudden the inode numbers are the same as in the directory
behind them.
> >> With doing the referral mount thing, each subvolume appears to be
its own mount
> >> and that way things like NFS and samba will work properly.
> >>
> >> I feel like I''m forgetting something here, hopefully
somebody will point it out.
> >>
> >> === Conclusion ==> >>
> >> There are definitely some wonky things with subvolumes, but I
don''t think they
> >> are things that cannot be fixed now.  Some of these changes will
require
> >> incompat format changes, but it''s either we fix it now,
or later on down the
> >> road when BTRFS starts getting used in production really find out
how many
> >> things our current scheme breaks and then have to do the changes
then.  Thanks,
> >>
> >
> > So now that I''ve actually looked at everything, it looks like
the semantics are
> > all right for subvolumes
> >
> > 1) readdir - we return the root id in d_ino, which is unique across
the fs
> > 2) stat - we return 256 for all subvolumes, because that is their
inode number
> > 3) dev_t - we setup an anon super for all volumes, so they all get
their own
> > dev_t, which is set properly for all of their children, see below
> >
> > [root@test1244 btrfs-test]# stat .
> >  File: `.''
> >  Size: 20              Blocks: 8          IO Block: 4096   directory
> > Device: 15h/21d Inode: 256         Links: 1
> > Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/  
 root)
> > Access: 2010-12-03 15:35:41.931679393 -0500
> > Modify: 2010-12-03 15:35:20.405679493 -0500
> > Change: 2010-12-03 15:35:20.405679493 -0500
> >
> > [root@test1244 btrfs-test]# stat foo
> >  File: `foo''
> >  Size: 12              Blocks: 0          IO Block: 4096   directory
> > Device: 19h/25d Inode: 256         Links: 1
> > Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/  
 root)
> > Access: 2010-12-03 15:35:17.501679393 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > [root@test1244 btrfs-test]# stat foo/foobar
> >  File: `foo/foobar''
> >  Size: 0               Blocks: 0          IO Block: 4096   regular
empty file
> > Device: 19h/25d Inode: 257         Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/  
 root)
> > Access: 2010-12-03 15:35:59.150680051 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > So as far as the user is concerned, everything should come out right.
 Obviously
> > we had to do the NFS trickery still because as far as VFS is concerned
the
> > subvolumes are all on the same mount.  So the question is this (and
really this
> > is directed at Christoph and Bruce and anybody else who may care), is
this good
> > enough, or do we want to have a seperate vfsmount for each subvolume?
 Thanks,
> >
> 
> What are the drawbacks of having a vfsmount for each subvolume?
> 
> Why (besides having to code it up) are you trying to avoid doing it that
way?
It''s the having to code it up that way thing, I''m nothing if
not lazy.

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-06 16:48 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger
wrote:> On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> > We''re using statfs64.fs_fsid for this; I believe
that''s both stable
> > across reboots and distinguishes between subvolumes, so
that''s OK.
> > 
> > (That said, since fs_fsid doesn''t work for other filesystems,
we depend
> > on an explicit check for a filesystem type of "btrfs", which
is
> > awful--btrfs won''t always be the only filesystem that wants
to do this
> > kind of thing, etc.)
> 
> Sigh, I wanted to be able to specify the NFS FSID directly from within the
kernel for Lustre many years already.  Glad to see that this is moving forward.
> 
> Any chance we can add a ->get_fsid(sb, inode) method to
export_operations
> (or something simiar), that allows the filesystem to generate an FSID based
on the volume and inode that is being exported?
No objection from here.

(Though I don''t understand the inode argument--aren''t
"subvolumes"
usually expected to have separate superblocks?)

--b.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2010-Dec-07 16:48 UTC

head link

Re: What to do about subvolumes?

> === What do subvolumes look like? ==> 
> All the user sees are directories.  They act like any other directory acts,
with
> a few exceptions
> 
> 1) You cannot hardlink between subvolumes.  This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this
case,
> you cannot hardlink between two mounts because the link needs to point to
the
> same on disk inode, which is impossible between two different filesystems. 
The
> same is true for subvolumes, they have their own trees with their own
inodes and
> inode numbers, so it''s impossible to hardlink between them.
which means they act like a different mount point.
> 1a) In case it wasn''t clear from above, each subvolume has their
own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
which means they act like not just a different mount point, but they
also act like beeing a separate superblock.
> 2) Obviously you can''t just rm -rf subvolumes.  Because they are
roots there''s
> extra metadata to keep track of them, so you have to use one of our ioctls
to
> delete subvolumes/snapshots.
Again this means they act like a mount point.
> 1) Users need to be able to create their own subvolumes.  The permission
> semantics will be absolutely the same as creating directories, so I
don''t think
> this is too tricky.  We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
Not that I''m entirely against this, but instead of just stating they
must can you also state the detailed reason?  Allowing users to create
your subvolumes is a mostly equivalent problem to allowing user mounts,
so handling those two under one umbrella makes a lot of sense.
> This is where I expect to see the most discussion.  Here is what I want to
do
> 
> 1) Scrap the 256 inode number thing.  Instead we''ll just put a
flag in the inode
> to say "Hey, I''m a subvolume" and then we can do all of
the appropriate magic
> that way.  This unfortunately will be an incompatible format change, but
the
> sooner we get this adressed the easier it will be in the long run. 
Obviously
> when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> be broken and such.
From reading later post in this threads readddir already seems to take
care of this in some way.  But is there a chance of collisions between
real inode numbers and the ones faked up for the subvolume roots?
> 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> clear, so it will confuse people (and samba) when they walk into a
subvolume and
> all of a sudden the inode numbers are the same as in the directory behind
them.
> With doing the referral mount thing, each subvolume appears to be its own
mount
> and that way things like NFS and samba will work properly.
> 
> I feel like I''m forgetting something here, hopefully somebody will
point it out.
The current code requires the automount trigger points to be links,
which is something that Chris didn''t like at all.  But that issue is
solved by building upong David Howell''s series to replace that
follow_link magic with a new d_automount dentry operation.  I''d suggest
building the new code on top of that.

And most importantly:

 3) allocate a different anon dev_t for each subvolume.


One thing that really confuses me is that the the actual root of the
subvolume appears directly in the parent namespace.  Given that you have
your subvolume identifiers that doesn''t even seems nessecary.

To me the following scheme seems more useful:

 - all subvolumes/snapshots only show up in a virtual below-root
   directory, similar to how the existing "default" one
doesn''t
   sit on the top.
 - the entries inside a namespace that are to be automounted have
   an entry in the filesystem that just marks them as an auto-mount
   point that redirects to the actual subvolume.
 - we still allow mounting subvolumes (and only those) directly
   from get_sb by specifying the subvolume name.

This is especially important for snapshots, as just having them hang
off the filesystem that is to be snapshotted is extremly confusing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2010-Dec-07 16:51 UTC

head link

Re: What to do about subvolumes?

On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner
wrote:> A property of NFS fileshandles is that they must be stable across
> server reboots. Is this anon dev_t used as part of the NFS
> filehandle and if so how can you guarantee that it is stable?
It''s just as stable as a real dev_t in the times of hotplug and udev.
As long as you don''t touch anything including not upgrading the kernel
it''s remain stable, otherwise it will break.  That''s why
modern
nfs-utils default to using the uuid-based filehandle schemes instead of
the dev_t based ones.  At least that''s what I told - I really hope
it''s
using the real UUIDs from the filesystem and not the horrible fsid hack
that was once added - for some filesystems like XFS that field does not
actually have any relation to the UUID historically.  And while we could
have changed that it''s too late now that nfs was hacked into abusing
that field.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

hch

2010-Dec-07 16:52 UTC

head link

Re: What to do about subvolumes?

On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields
wrote:> We''re using statfs64.fs_fsid for this; I believe that''s
both stable
> across reboots and distinguishes between subvolumes, so that''s OK.
It''s a field that doesn''t have any useful specification and
basically
contains random garbage that a filesystem put into it.  Using it is a
very bad idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust

2010-Dec-07 17:02 UTC

head link

Re: What to do about subvolumes?

On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig
wrote:> On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:
> > A property of NFS fileshandles is that they must be stable across
> > server reboots. Is this anon dev_t used as part of the NFS
> > filehandle and if so how can you guarantee that it is stable?
> 
> It''s just as stable as a real dev_t in the times of hotplug and
udev.
> As long as you don''t touch anything including not upgrading the
kernel
> it''s remain stable, otherwise it will break.  That''s why
modern
> nfs-utils default to using the uuid-based filehandle schemes instead of
> the dev_t based ones.  At least that''s what I told - I really hope
it''s
> using the real UUIDs from the filesystem and not the horrible fsid hack
> that was once added - for some filesystems like XFS that field does not
> actually have any relation to the UUID historically.  And while we could
> have changed that it''s too late now that nfs was hacked into
abusing
> that field.
IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
they won''t fit into the NFSv2 32-byte filehandles, so there is an
''8-byte fsid'' and ''4-byte fsid + inode
number'' workaround for that...

See the mk_fsid() helper in fs/nfsd/nfsfh.h

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-07 20:45 UTC

head link

Re: What to do about subvolumes?

On Tue, Dec 07, 2010 at 05:52:13PM +0100, hch wrote:> On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
> > We''re using statfs64.fs_fsid for this; I believe
that''s both stable
> > across reboots and distinguishes between subvolumes, so
that''s OK.
> 
> It''s a field that doesn''t have any useful specification
and basically
> contains random garbage that a filesystem put into it.  Using it is a
> very bad idea.
I meant the above statement to apply only to btrfs; and nfs-utils is
using fs_fsid only in the case where the filesystem type is "btrfs". 
So
I believe the current code does work.

But I agree that constructing filehandles differently based on a
strcmp() of the filesystem type is not a sustainable design, to say the
least.

--b.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2010-Dec-08 06:39 UTC

head link

Re: What to do about subvolumes?

On 2010-12-06, at 09:48, J. Bruce Fields wrote:
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger
wrote:>> Any chance we can add a ->get_fsid(sb, inode) method to
>> export_operations (or something simiar), that allows the filesystem to
>> generate an FSID based on the volume and inode that is being exported?
> 
> No objection from here.
> 
> (Though I don''t understand the inode argument--aren''t
"subvolumes"
> usually expected to have separate superblocks?)
I thought that if two directories from the same filesystem are both being
exported at the same time that they would need to have different FSID values,
hence the inode parameter to allow generating an FSID that is a function of both
the filesystem (sb) and the directory being exported (inode)?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2010-Dec-08 17:16 UTC

head link

Re: What to do about subvolumes?

On 2010-12-07, at 10:02, Trond Myklebust wrote:
> On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
>> It''s just as stable as a real dev_t in the times of hotplug
and udev.
>> As long as you don''t touch anything including not upgrading
the kernel
>> it''s remain stable, otherwise it will break.  That''s
why modern
>> nfs-utils default to using the uuid-based filehandle schemes instead of
>> the dev_t based ones.  At least that''s what I told - I really
hope it''s
>> using the real UUIDs from the filesystem and not the horrible fsid hack
>> that was once added - for some filesystems like XFS that field does not
>> actually have any relation to the UUID historically.  And while we
>> could have changed that it''s too late now that nfs was hacked
into
>> abusing that field.
> 
> IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> they won''t fit into the NFSv2 32-byte filehandles, so there is an
> ''8-byte fsid'' and ''4-byte fsid + inode
number'' workaround for that...
> 
> See the mk_fsid() helper in fs/nfsd/nfsfh.h
It looks like mk_fsid() is only actually using the UUID if it is specified in
the /etc/exports file (AFAICS, this depends on ex_uuid being set from a
uuid="..." option).

There was a patch in the open_by_handle() patch series that added an s_uuid
field to the superblock, that could be used if no uuid= option is specified in
the /etc/exports file.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-08 17:27 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger
wrote:> On 2010-12-07, at 10:02, Trond Myklebust wrote:
> 
> > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
> >> It''s just as stable as a real dev_t in the times of
hotplug and udev.
> >> As long as you don''t touch anything including not
upgrading the kernel
> >> it''s remain stable, otherwise it will break. 
That''s why modern
> >> nfs-utils default to using the uuid-based filehandle schemes
instead of
> >> the dev_t based ones.  At least that''s what I told - I
really hope it''s
> >> using the real UUIDs from the filesystem and not the horrible fsid
hack
> >> that was once added - for some filesystems like XFS that field
does not
> >> actually have any relation to the UUID historically.  And while we
> >> could have changed that it''s too late now that nfs was
hacked into
> >> abusing that field.
> > 
> > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> > they won''t fit into the NFSv2 32-byte filehandles, so there
is an
> > ''8-byte fsid'' and ''4-byte fsid + inode
number'' workaround for that...
> > 
> > See the mk_fsid() helper in fs/nfsd/nfsfh.h
> 
> It looks like mk_fsid() is only actually using the UUID if it is specified
in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a
uuid="..." option).
No, if you look at the nfs-utils source you''ll find mountd sets a uuid
by default (in utils/mountd/cache.c:uuid_by_path()).
> There was a patch in the open_by_handle() patch series that added an s_uuid
field to the superblock, that could be used if no uuid= option is specified in
the /etc/exports file.
Agreed that doing this in the kernel would probably be simpler.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2010-Dec-08 21:18 UTC

head link

Re: What to do about subvolumes?

On 2010-12-08, at 10:27, J. Bruce Fields wrote:> On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
>> It looks like mk_fsid() is only actually using the UUID if it is
specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set
from a uuid="..." option).
> 
> No, if you look at the nfs-utils source you''ll find mountd sets a
uuid
> by default (in utils/mountd/cache.c:uuid_by_path()).
Unfortunately, this only works for block devices, not network filesystems.
>> There was a patch in the open_by_handle() patch series that added an
s_uuid field to the superblock, that could be used if no uuid= option is
specified in the /etc/exports file.
> 
> Agreed that doing this in the kernel would probably be simpler.
Agreed.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Neil Brown

2010-Dec-08 23:07 UTC

head link

Re: What to do about subvolumes?

On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields"
<bfields@redhat.com>
wrote:
> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> > On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> > > We''re using statfs64.fs_fsid for this; I believe
that''s both stable
> > > across reboots and distinguishes between subvolumes, so
that''s OK.
> > > 
> > > (That said, since fs_fsid doesn''t work for other
filesystems, we depend
> > > on an explicit check for a filesystem type of "btrfs",
which is
> > > awful--btrfs won''t always be the only filesystem that
wants to do this
> > > kind of thing, etc.)
> > 
> > Sigh, I wanted to be able to specify the NFS FSID directly from within
the kernel for Lustre many years already.  Glad to see that this is moving
forward.
> > 
> > Any chance we can add a ->get_fsid(sb, inode) method to
export_operations
> > (or something simiar), that allows the filesystem to generate an FSID
based on the volume and inode that is being exported?
> 
> No objection from here.
My standard objection here is that you cannot guarantee that the fsid is 100%
guarantied to be unique across all filesystems in the system (including
filesystems mounted from dm snapshots of filesystems that are currently
mounted).  NFSd needs this uniqueness.

This is only really an objection if user-space cannot over-ride the fsid
provided by the filesystem.

I''d be very happy to see an interface to user-space whereby user-space
can
get a reasonably unique fsid for a given filesystem.  Whether this is an
export_operations method or some field in the ''struct super''
which gets
copied out doesn''t matter to me.

NeilBrown

> 
> (Though I don''t understand the inode argument--aren''t
"subvolumes"
> usually expected to have separate superblocks?)
> 
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2010-Dec-09 04:41 UTC

head link

Re: What to do about subvolumes?

On 2010-12-08, at 16:07, Neil Brown wrote:> On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields"
<bfields@redhat.com>
> wrote:
> 
>> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
>>> Any chance we can add a ->get_fsid(sb, inode) method to
>>> export_operations (or something simiar), that allows the
>>> filesystem to generate an FSID based on the volume and
>>> inode that is being exported?
>> 
>> No objection from here.
> 
> My standard objection here is that you cannot guarantee that the
> fsid is 100% guarantied to be unique across all filesystems in
> the system (including filesystems mounted from dm snapshots of
> filesystems that are currently mounted).  NFSd needs this uniqueness.
Sure, but you also cannot guarantee that the devno is constant across reboots,
yet NFS continues to use this much-less-constant value...
> This is only really an objection if user-space cannot over-ride
> the fsid provided by the filesystem.
Agreed.  It definitely makes sense to allow this, for whatever strange
circumstances might arise.  However, defaulting to using the filesystem UUID
definitely makes the most sense, and looking at the nfs-utils mountd code, it
seems that this is already standard behaviour for local block devices (excluding
"btrfs" filesystems).
> I''d be very happy to see an interface to user-space whereby
> user-space can get a reasonably unique fsid for a given
> filesystem.
Hmm, maybe I''m missing something, but why does userspace need to be
able to get this value?  I would think that nfsd gets it from the filesystem
directly in the kernel, but if a "uuid=" option is present in the
exports file that is preferentially used over the value from the filesystem.

That said, I think Aneesh''s open_by_handle patchset also made the UUID
visible in /proc/<pid>/mountinfo, after the filesystems stored it in
sb->s_uuid at mount time.  That _should_ make it visible for non-block
mountpoints as well, assuming they fill in s_uuid.
> Whether this is an export_operations method or some field in the
> ''struct super'' which gets copied out doesn''t
matter to me.
Since Aneesh has already developed patches, is there any objection to using
those (last sent to linux-fsdevel on 2010-10-29):

[PATCH -V22 12/14] vfs: Export file system uuid via /proc/<pid>/mountinfo
[PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
[PATCH -V22 14/14] ext4: Copy fs UUID to superblock

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2010-Dec-09 15:19 UTC

head link

Re: What to do about subvolumes?

On Wed, Dec 08, 2010 at 09:41:33PM -0700, Andreas Dilger
wrote:> On 2010-12-08, at 16:07, Neil Brown wrote:
> > On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields"
<bfields@redhat.com>
> > wrote:
> > 
> >> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> >>> Any chance we can add a ->get_fsid(sb, inode) method to
> >>> export_operations (or something simiar), that allows the
> >>> filesystem to generate an FSID based on the volume and
> >>> inode that is being exported?
> >> 
> >> No objection from here.
> > 
> > My standard objection here is that you cannot guarantee that the
> > fsid is 100% guarantied to be unique across all filesystems in
> > the system (including filesystems mounted from dm snapshots of
> > filesystems that are currently mounted).  NFSd needs this uniqueness.
> 
> Sure, but you also cannot guarantee that the devno is constant across
reboots, yet NFS continues to use this much-less-constant value...
> 
> > This is only really an objection if user-space cannot over-ride
> > the fsid provided by the filesystem.
> 
> Agreed.  It definitely makes sense to allow this, for whatever strange
circumstances might arise.  However, defaulting to using the filesystem UUID
definitely makes the most sense, and looking at the nfs-utils mountd code, it
seems that this is already standard behaviour for local block devices (excluding
"btrfs" filesystems).
> 
> > I''d be very happy to see an interface to user-space whereby
> > user-space can get a reasonably unique fsid for a given
> > filesystem.
> 
> Hmm, maybe I''m missing something, but why does userspace need to
be able to get this value?  I would think that nfsd gets it from the filesystem
directly in the kernel, but if a "uuid=" option is present in the
exports file that is preferentially used over the value from the filesystem.
Well, the kernel can''t distinguish the case of an explicit
"uuid="
option in /etc/exports from one that was (as is the normal default)
generated automatically by mountd.  Maybe not a big deal.

The uuid seems like a useful thing to have access to from userspace
anyway, for userspace nfs servers if for no other reason:
> That said, I think Aneesh''s open_by_handle patchset also made the
UUID visible in /proc/<pid>/mountinfo, after the filesystems stored it in
> sb->s_uuid at mount time.  That _should_ make it visible for non-block
mountpoints as well, assuming they fill in s_uuid.
> 
> > Whether this is an export_operations method or some field in the
> > ''struct super'' which gets copied out
doesn''t matter to me.
> 
> Since Aneesh has already developed patches, is there any objection to using
those (last sent to linux-fsdevel on 2010-10-29):
> 
> [PATCH -V22 12/14] vfs: Export file system uuid via
/proc/<pid>/mountinfo
> [PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
> [PATCH -V22 14/14] ext4: Copy fs UUID to superblock
I can''t see anything wrong with that.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin Steigerwald

2010-Dec-09 19:53 UTC

head link

Re: What to do about subvolumes?

Am Mittwoch 01 Dezember 2010 schrieb Mike Hommey:> On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:
> > Excerpts from C Anthony Risinger''s message of 2010-12-01
09:51:55
-0500:> > > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik
<josef@redhat.com>
wrote:> > > > === How do we want subvolumes to work from a user
perspective?
> > > > ==> > > > 
> > > > 1) Users need to be able to create their own subvolumes. Â
The
> > > > permission semantics will be absolutely the same as creating
> > > > directories, so I don''t think this is too tricky. Â
We want this
> > > > because you can only take snapshots of subvolumes, and so it
is
> > > > important that users be able to create their own discrete
> > > > snapshottable targets.
> > > > 
> > > > 2) Users need to be able to snapshot their subvolumes. Â
This is
> > > > basically the same as #1, but it bears repeating.
> > > 
> > > could it be possible to convert a directory into a volume?  or at
> > > least base a snapshot off it?
> > 
> > I''m afraid this turns into the same complexity as creating a
new
> > volume and copying all the files/dirs in by hand.
> 
> Except you wouldn''t have to copy data, only metadata.
And it could probably be race-free. If I''d  cp -reflink or rsync stuff
from
a real directory to a subvolume and then rename the old directory to an 
other name and the subvolume to the directory name then I might be missing 
files that have been created during the copy process and missing changes to 
files that have been already copied.

What I would like is an easy way to make ~/.kde or whatever a subvolume to 
be able to snapshot it independently while KDE applications or whatever is 
using and writing to it, *without* any userland even noticing it and 
without - except for metadata for managing the subvolume - any additional 
space consumption.

So

deepdance:/#12> btrfs subvolume create /home/martin/.kde
ERROR: ''/home/martin/.kde'' exists

would just make a subvolume out of ~/.kde even if it needs splitting out 
the tree or even copying the tree data into a new tree.

There are other filesystem operations like btrfs filesystem balance that can 
be expensive as well.

All that said from a user point of view. Maybe technical its not feasible. 
But it would be nice if it can be made feasible without loosing existing 
advantages.

And maybe

deepdance:/> btrfs subvolume create .   
ERROR: ''.'' exists

should really remain this way ;).

-- 
Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

Ian Kent

2011-Jan-31 02:40 UTC

head link

Re: What to do about subvolumes?

On Thu, 2010-12-02 at 10:49 +0100, Arne Jansen wrote:> Josef Bacik wrote:
> > 
> > 1) Scrap the 256 inode number thing.  Instead we''ll just put
a flag in the inode
> > to say "Hey, I''m a subvolume" and then we can do
all of the appropriate magic
> > that way.  This unfortunately will be an incompatible format change,
but the
> > sooner we get this adressed the easier it will be in the long run. 
Obviously
> > when I say format change I mean via the incompat bits we have, so old
fs''s won''t
> > be broken and such.
> > 
> > 2) Do something like NFS''s referral mounts when we cd into a
subvolume.  Now we
> > just do dentry trickery, but that doesn''t make the boundary
between subvolumes
> > clear, so it will confuse people (and samba) when they walk into a
subvolume and
> > all of a sudden the inode numbers are the same as in the directory
behind them.
> > With doing the referral mount thing, each subvolume appears to be its
own mount
> > and that way things like NFS and samba will work properly.
> > 
> 
> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.
> 
> Having one mount per subvolume/snapshots is the cleaner solution, but
> quickly leads to situations where you have _lots_ of mounts, especially
when
> you export them via NFS and mount it somewhere else. I''ve seen a
machine
> which had to handle > 100,000 mounts from a zfs server. This definitely
> brings it''s own problems, so I''d love to see a full fs
exported as a single
> mount. This will also keep output from tools like iostat (for nfs mounts)
> and df readable.
Having a lot of mounts will be a problem when the mount table is exposed
directly from the kernel, something that must be done, and is being done
in the latest util-linux.

Ian


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ian Kent

2011-Jan-31 02:56 UTC

head link

Re: What to do about subvolumes?

On Mon, 2010-12-06 at 09:27 -0500, Josef Bacik wrote:> On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:
> > On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com>
wrote:
> > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > >> Hello,
> > >>
> > >> Various people have complained about how BTRFS deals with
subvolumes recently,
> > >> specifically the fact that they all have the same inode
number, and there''s no
> > >> discrete seperation from one subvolume to another.  Christoph
asked that I lay
> > >> out a basic design document of how we want subvolumes to work
so we can hash
> > >> everything out now, fix what is broken, and then move forward
with a design that
> > >> everybody is more or less happy with.  I apologize in advance
for how freaking
> > >> long this email is going to be.  I assume that most people
are generally
> > >> familiar with how BTRFS works, so I''m not going to
bother explaining in great
> > >> detail some stuff.
> > >>
> > >> === What are subvolumes? ==> > >>
> > >> They are just another tree.  In BTRFS we have various b-trees
to describe the
> > >> filesystem.  A few of them are filesystem wide, such as the
extent tree, chunk
> > >> tree, root tree etc.  The tree''s that hold the
actual filesystem data, that is
> > >> inodes and such, are kept in their own b-tree.  This is how
subvolumes and
> > >> snapshots appear on disk, they are simply new b-trees with
all of the file data
> > >> contained within them.
> > >>
> > >> === What do subvolumes look like? ==> > >>
> > >> All the user sees are directories.  They act like any other
directory acts, with
> > >> a few exceptions
> > >>
> > >> 1) You cannot hardlink between subvolumes.  This is because
subvolumes have
> > >> their own inode numbers and such, think of them as seperate
mounts in this case,
> > >> you cannot hardlink between two mounts because the link needs
to point to the
> > >> same on disk inode, which is impossible between two different
filesystems.  The
> > >> same is true for subvolumes, they have their own trees with
their own inodes and
> > >> inode numbers, so it''s impossible to hardlink
between them.
> > >>
> > >> 1a) In case it wasn''t clear from above, each
subvolume has their own inode
> > >> numbers, so you can have the same inode numbers used between
two different
> > >> subvolumes, since they are two different trees.
> > >>
> > >> 2) Obviously you can''t just rm -rf subvolumes. 
Because they are roots there''s
> > >> extra metadata to keep track of them, so you have to use one
of our ioctls to
> > >> delete subvolumes/snapshots.
> > >>
> > >> But permissions and everything else they are the same.
> > >>
> > >> There is one tricky thing.  When you create a subvolume, the
directory inode
> > >> that is created in the parent subvolume has the inode number
of 256.  So if you
> > >> have a bunch of subvolumes in the same parent subvolume, you
are going to have a
> > >> bunch of directories with the inode number of 256.  This is
so when users cd
> > >> into a subvolume we can know its a subvolume and do all the
normal voodoo to
> > >> start looking in the subvolumes tree instead of the parent
subvolumes tree.
> > >>
> > >> This is where things go a bit sideways.  We had serious
problems with NFS, but
> > >> thankfully NFS gives us a bunch of hooks to get around these
problems.
> > >> CIFS/Samba do not, so we will have problems there, not to
mention any other
> > >> userspace application that looks at inode numbers.
> > >>
> > >> === How do we want subvolumes to work from a user
perspective? ==> > >>
> > >> 1) Users need to be able to create their own subvolumes.  The
permission
> > >> semantics will be absolutely the same as creating
directories, so I don''t think
> > >> this is too tricky.  We want this because you can only take
snapshots of
> > >> subvolumes, and so it is important that users be able to
create their own
> > >> discrete snapshottable targets.
> > >>
> > >> 2) Users need to be able to snapshot their subvolumes.  This
is basically the
> > >> same as #1, but it bears repeating.
> > >>
> > >> 3) Subvolumes shouldn''t need to be specifically
mounted.  This is also
> > >> important, we don''t want users to have to go around
mounting their subvolumes up
> > >> manually one-by-one.  Today users just cd into subvolumes and
it works, just
> > >> like cd''ing into a directory.
> > >>
> > >> === Quotas ==> > >>
> > >> This is a huge topic in and of itself, but Christoph
mentioned wanting to have
> > >> an idea of what we wanted to do with it, so I''m
putting it here.  There are
> > >> really 2 things here
> > >>
> > >> 1) Limiting the size of subvolumes.  This is really easy for
us, just create a
> > >> subvolume and at creation time set a maximum size it can grow
to and not let it
> > >> go farther than that.  Nice, simple and straightforward.
> > >>
> > >> 2) Normal quotas, via the quota tools.  This just comes down
to how do we want
> > >> to charge users, do we want to do it per subvolume, or per
filesystem.  My vote
> > >> is per filesystem.  Obviously this will make it tricky with
snapshots, but I
> > >> think if we''re just charging the diff''s
between the original volume and the
> > >> snapshot to the user then that will be the easiest for people
to understand,
> > >> rather than making a snapshot all of a sudden count the users
currently used
> > >> quota * 2.
> > >>
> > >> === What do we do? ==> > >>
> > >> This is where I expect to see the most discussion.  Here is
what I want to do
> > >>
> > >> 1) Scrap the 256 inode number thing.  Instead we''ll
just put a flag in the inode
> > >> to say "Hey, I''m a subvolume" and then we
can do all of the appropriate magic
> > >> that way.  This unfortunately will be an incompatible format
change, but the
> > >> sooner we get this adressed the easier it will be in the long
run.  Obviously
> > >> when I say format change I mean via the incompat bits we
have, so old fs''s won''t
> > >> be broken and such.
> > >>
> > >> 2) Do something like NFS''s referral mounts when we
cd into a subvolume.  Now we
> > >> just do dentry trickery, but that doesn''t make the
boundary between subvolumes
> > >> clear, so it will confuse people (and samba) when they walk
into a subvolume and
> > >> all of a sudden the inode numbers are the same as in the
directory behind them.
> > >> With doing the referral mount thing, each subvolume appears
to be its own mount
> > >> and that way things like NFS and samba will work properly.
> > >>
> > >> I feel like I''m forgetting something here, hopefully
somebody will point it out.
> > >>
> > >> === Conclusion ==> > >>
> > >> There are definitely some wonky things with subvolumes, but I
don''t think they
> > >> are things that cannot be fixed now.  Some of these changes
will require
> > >> incompat format changes, but it''s either we fix it
now, or later on down the
> > >> road when BTRFS starts getting used in production really find
out how many
> > >> things our current scheme breaks and then have to do the
changes then.  Thanks,
> > >>
> > >
> > > So now that I''ve actually looked at everything, it looks
like the semantics are
> > > all right for subvolumes
> > >
> > > 1) readdir - we return the root id in d_ino, which is unique
across the fs
> > > 2) stat - we return 256 for all subvolumes, because that is their
inode number
> > > 3) dev_t - we setup an anon super for all volumes, so they all
get their own
> > > dev_t, which is set properly for all of their children, see below
> > >
> > > [root@test1244 btrfs-test]# stat .
> > >  File: `.''
> > >  Size: 20              Blocks: 8          IO Block: 4096  
directory
> > > Device: 15h/21d Inode: 256         Links: 1
> > > Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/  
root)
> > > Access: 2010-12-03 15:35:41.931679393 -0500
> > > Modify: 2010-12-03 15:35:20.405679493 -0500
> > > Change: 2010-12-03 15:35:20.405679493 -0500
> > >
> > > [root@test1244 btrfs-test]# stat foo
> > >  File: `foo''
> > >  Size: 12              Blocks: 0          IO Block: 4096  
directory
> > > Device: 19h/25d Inode: 256         Links: 1
> > > Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/  
root)
> > > Access: 2010-12-03 15:35:17.501679393 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > [root@test1244 btrfs-test]# stat foo/foobar
> > >  File: `foo/foobar''
> > >  Size: 0               Blocks: 0          IO Block: 4096  
regular empty file
> > > Device: 19h/25d Inode: 257         Links: 1
> > > Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/  
root)
> > > Access: 2010-12-03 15:35:59.150680051 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > So as far as the user is concerned, everything should come out
right.  Obviously
> > > we had to do the NFS trickery still because as far as VFS is
concerned the
> > > subvolumes are all on the same mount.  So the question is this
(and really this
> > > is directed at Christoph and Bruce and anybody else who may
care), is this good
> > > enough, or do we want to have a seperate vfsmount for each
subvolume?  Thanks,
> > >
> > 
> > What are the drawbacks of having a vfsmount for each subvolume?
> > 
> > Why (besides having to code it up) are you trying to avoid doing it
that way?
> 
> It''s the having to code it up that way thing, I''m nothing
if not lazy.
And, anything that uses the mount table, exposed from the kernel, will
grind a system to a halt with only a few thousand mounts, not to mention
that user space utilities, like df, du ..., will become painful to use
for more than a hundred or so entries.
> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Dec 2010 - What to do about subvolumes?

What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Fwd: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?

Re: What to do about subvolumes?