thr3ads.net - Lustre devel - [Lustre-devel] Moving forward on Quotas [May 2008]

If this information is useful, please help other people find it:
Share via:

Peter Braam

2008-May-26 23:28 UTC

[Lustre-devel] Moving forward on Quotas

[please send original message to Lustre devel also]

Hi 


On 5/27/08 1:56 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM>
wrote:
> Johann Lombardi writes:
>> Hi all,
>> 
> 
> [...]
> 
>> 
>> * item #2: Supporting quotas with CMD
>> 
>> The quota master is the only one having a global overview of the quota
usages
>> and limits. On b1_6, the quota master is the MDS and the quota slaves
are the
>> OSSs. The code is designed in theory to support several MDT slaves too,
but
>> some
>> shortcuts have been taken and some additional work is needed to support
an
>> architecture with 1 quota master (one of the MDT) and several OSTs/MDTs
>> slaves.
> 
> From reading quota hld it is not clear that master has necessary be an
> mdt server. Given that ost''s are going to have mdt-like recovery
in 2.0
> it seems reasonable to hash uid/gid across all osts, that act as masters
> (additionally, it seems logical to handle disk block allocation on
ost''s
> rather than mdt''s). Or am I missing something here?
Yes - and this being possible was part of the plan originally.

> 
>> 
>> * item #3: Supporting quotas with DMU
>> 
>> ZFS does not support standard Unix quotas. Instead, it relies on
fileset
>> quotas.
>> This is a problem because Lustre quotas are set on a per-uid/gid basis.
>> To support ZFS, we are going to have to put OST objects in a dataset
matching
>> a
>> dataset on the MDS.
>> We also have to decide what kind of quota interface we want to have at
the
>> lustre level (do we still set quotas on uid/gid or do we switch to the
>> dataset
>> framework?). Things get more complicated if we want to support a MDS
using
>> ldiskfs and OSSs using ZFS (do we have to support this?).
>> IMHO, in the future, Lustre will want to take advantage of the ZFS
space
>> reservation feature and since this also relies on dataset, I think we
should
>> adopt the ZFS quota framework at the lustre level too.
>> That being said, my understanding of ZFS quotas is limited to this
webpage:
>>
http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch05
>> s06.html
>> and I haven''t had the time to dig further.
> 
> As per discussion with ZFS team, they are going to implement per-user
> and per-group block quotas in ZFS (inode quotas make little sense for
> ZFS).
Why do they not need file quota - what if someone wants to control file
count?
 > Going aside, if I were designing quota from the scratch right now, I
> would implement it completely inside of Lustre. All that is needed for
> such an implementation is a set of call-backs that local file-system
> invokes when it allocates/frees blocks (or inodes) for a given
> object. Lustre would use these call-backs to transactionally update
> local quota in its own format. That would save us a lot of hassle we
> have dealing with the changing kernel quota interfaces, uid re-mappings,
> and subtle differences between quota implementations on a different file
> systems.
======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get
quota
implemented in this way in DMU then.
 > Additionally, this is better aligned with the way we handle access
> control: MDT implements access control checks completely within MDD,
> without relying on underlying file system.
> 
>> 
> 
> [...]
> 
>> 
>> * issue #2: Quota accuracy
>> 
>> When a slave runs out of its local quota, it sends an acquire request
to the
>> quota master. As I said earlier, the quota master is the only one
having a
>> global overview of what has been granted to slaves. If the master can
satisfy
>> the request, it grants a qunit (can be a number of blocks or inodes) to
the
>> slave. The problem is that an OST can return "quota exceeded"
(=EDQUOT)
>> whereas
>> another OST is still having quotas. There is currently no callback to
claim
>> back the quota space that has been granted to a slave.
Hmm - the slave should release quota.  Arguing about the last qunit of quota
is pointless, we are NOT interested in quota that are absurdly small I
think, they would interfere with performance and the whole architecture of
quota uses qunits to leave performance unaffected.
> 
> What strikes me in this description is how this is similar to DLM. It
> almost looks like quota can be easily implemented as a special type of
> lock, and DLM conflict resolution mechanism with cancellation
AST''s can
> be used to reclaim quota.
The slave should release.  That doesn''t address the issue of all
OST''s
consistently reporting EDQUOT.  However, doing that in a persistent way has
its own troubles maybe, namely, how would that be released and recovery
after power off on servers.  If not persistent, it would be pointless
because after server reboots, since OSS''s with space left would still
give
the wrong answer.

Peter

> 
>> 
> 
> [...]
> 
>> 
>> Cheers,
>> Johann
> 
> Nikita.

Johann Lombardi

2008-May-28 08:06 UTC

head link

[Lustre-devel] Moving forward on Quotas

Hello Peter,

On Tue, May 27, 2008 at 07:28:10AM +0800, Peter Braam
wrote:> >> When a slave runs out of its local quota, it sends an acquire
request to the
> >> quota master. As I said earlier, the quota master is the only one
having a
> >> global overview of what has been granted to slaves. If the master
can satisfy
> >> the request, it grants a qunit (can be a number of blocks or
inodes) to the
> >> slave. The problem is that an OST can return "quota
exceeded" (=EDQUOT)
> >> whereas
> >> another OST is still having quotas. There is currently no callback
to claim
> >> back the quota space that has been granted to a slave.
>
> Hmm - the slave should release quota.
I don''t think that the slave can make such a decision by itself since
it does
not know that we are getting closer to the global quota limit. Only the master
is aware of this.
Actually, the scenario I described above can no longer happen - with recent
lustre versions at least - thanks to the dynamic qunit patch because the
master broadcasts to all the slaves the new qunit size when it is shrunk.

Cheers,
Johann

Ricardo M. Correia

2008-May-28 14:29 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote:
> > Going aside, if I were designing quota from the scratch right now, I
> > would implement it completely inside of Lustre. All that is needed for
> > such an implementation is a set of call-backs that local file-system
> > invokes when it allocates/frees blocks (or inodes) for a given
> > object. Lustre would use these call-backs to transactionally update
> > local quota in its own format. That would save us a lot of hassle we
> > have dealing with the changing kernel quota interfaces, uid
re-mappings,
> > and subtle differences between quota implementations on a different
file
> > systems.
> 
> ======> IMPORTANT: get in touch with Jeff Bonwick now, let''s
get quota
> implemented in this way in DMU then.

I think this was proposed by Alex before, but AFAIU the conclusion is
that this was not possible to do with ZFS (or at least, not easy to do).

The problem is that ZFS uses delayed allocations, i.e., allocations
occur long after a transaction group has been closed, and therefore we
can''t transactionally keep track of allocated space because by the time
the callbacks were called we are not allowed to write to the transaction
group anymore, since another 2 txgs could have been opened already.

Since this couldn''t be done transactionally, if the node crashes, there
would be no way of knowing how many blocks had been allocated on the
latest (actually, the latest 2) committed transaction groups..

Regards,
Ricardo
--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/a7e653b2/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/a7e653b2/attachment.gif

Nikita Danilov

2008-May-28 14:54 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:
 > On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote:
 > 
 > > > Going aside, if I were designing quota from the scratch right
now, I
 > > > would implement it completely inside of Lustre. All that is
needed for
 > > > such an implementation is a set of call-backs that local
file-system
 > > > invokes when it allocates/frees blocks (or inodes) for a given
 > > > object. Lustre would use these call-backs to transactionally
update
 > > > local quota in its own format. That would save us a lot of
hassle we
 > > > have dealing with the changing kernel quota interfaces, uid
re-mappings,
 > > > and subtle differences between quota implementations on a
different file
 > > > systems.
 > > 
 > > ======> IMPORTANT: get in touch with Jeff Bonwick now,
let''s get quota
 > > implemented in this way in DMU then.
 > 
 > 
 > I think this was proposed by Alex before, but AFAIU the conclusion is
 > that this was not possible to do with ZFS (or at least, not easy to do).
 > 
 > The problem is that ZFS uses delayed allocations, i.e., allocations
 > occur long after a transaction group has been closed, and therefore we
 > can''t transactionally keep track of allocated space because by
the time
 > the callbacks were called we are not allowed to write to the transaction
 > group anymore, since another 2 txgs could have been opened already.

But that problem has to be solved anyway to implement per-user quotas
for ZFS, correct?

One possible solution I see is to use something like ZIL to log
operations in the context of current transaction group. This log can be
replayed during mount to update quota file.

 > 
 > Since this couldn''t be done transactionally, if the node crashes,
there
 > would be no way of knowing how many blocks had been allocated on the
 > latest (actually, the latest 2) committed transaction groups..
 > 
 > Regards,
 > Ricardo

Nikita.

Ricardo M. Correia

2008-May-28 15:14 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Qua, 2008-05-28 at 18:54 +0400, Nikita Danilov wrote:
> But that problem has to be solved anyway to implement per-user quotas
> for ZFS, correct?

Indeed, but it''s probably easier and more reliable to make the DMU
itself update an internal quota/space accounting DMU object when a txg
is syncing (updating internal objects during txg sync is something that
the DMU already does, e.g., for spacemaps) than allow arbitrary
modifications to a transaction group after it has been closed.

> One possible solution I see is to use something like ZIL to log
> operations in the context of current transaction group. This log can be
> replayed during mount to update quota file.

Hmm.. I''m not sure if it would be easy to figure out during replay how
many blocks were freed, especially considering things like snapshots,
clones and deferred frees (if frees are making a txg sync to take too
long to converge, the DMU will add them to a freelist object, instead of
freeing them immediately).

I agree that quotas could be implemented in Lustre (independent of the
backend filesystem), but IMHO I think it would make more sense for the
space accounting to be done in the DMU itself due to the complexities
associated with it''s internal behaviour.

Regards,
Ricardo
--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/86bce9ae/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/86bce9ae/attachment-0001.gif

Nikita Danilov

2008-May-28 15:24 UTC

head link

[Lustre-devel] Moving forward on Quotas

Peter Braam writes:
 > 
 > Why do they not need file quota - what if someone wants to control file
 > count?

Idea is that support for the inode quota is not necessary is DMU, as
upper layer can implement it by itself. Compare this with block quotas
that do require support from DMU, because upper layer cannot generally
tell overwrite from new-write.

 > > 

Nikita.

 >

Nikita Danilov

2008-May-28 16:22 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:
 > On Qua, 2008-05-28 at 18:54 +0400, Nikita Danilov wrote:
 > 
 > > But that problem has to be solved anyway to implement per-user quotas
 > > for ZFS, correct?
 > 
 > 
 > Indeed, but it''s probably easier and more reliable to make the
DMU
 > itself update an internal quota/space accounting DMU object when a txg
 > is syncing (updating internal objects during txg sync is something that
 > the DMU already does, e.g., for spacemaps) than allow arbitrary
 > modifications to a transaction group after it has been closed.

Even doing it internally looks rather involved. The problem, as I
understand it, is that no new block can be allocated while transaction
group is in sync state (?), so DMU would have to track all users and
groups whose quota is affected by the current transaction group, and
before closing the group, to allocate some kind of on-disk table with an
entry for every updated quota, then to fill these entries later when
actual disk space is allocated.

Note that dmu has to know about users and groups to implement quota
internally, which looks like a pervasive interface change.

 > 
 > I agree that quotas could be implemented in Lustre (independent of the
 > backend filesystem), but IMHO I think it would make more sense for the
 > space accounting to be done in the DMU itself due to the complexities
 > associated with it''s internal behaviour.

I absolutely agree that DMU has to do space _accounting_ internally. The
question is how to store results of this accounting, without bothering
DMU with higher level concepts such as a user or a group identifier.

I think that utility of DMU as a universal back-end would improve if it
were to export an interface allowing its users to update, with certain
restrictions, on-disk state when transaction group is in sync (i.e.,
interface similar to one that is internally used for spacemaps).

 > 
 > Regards,
 > Ricardo

Nikita.

Ricardo M. Correia

2008-May-28 17:05 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Qua, 2008-05-28 at 20:22 +0400, Nikita Danilov wrote:
> Even doing it internally looks rather involved. The problem, as I
> understand it, is that no new block can be allocated while transaction
> group is in sync state (?)

I''m not sure if you are describing it incorrectly or just using the
same
terms for different concepts, but in any case, blocks are allocated
*while* the transaction group is syncing, and due to compression and
online pool configuration changes it is impossible to know the exact
on-disk space a given block will use until the transaction group is
actually syncing.

> so DMU would have to track all users and
> groups whose quota is affected by the current transaction group, and
> before closing the group, to allocate some kind of on-disk table with an
> entry for every updated quota, then to fill these entries later when
> actual disk space is allocated.

Yes, that sounds correct.

> Note that dmu has to know about users and groups to implement quota
> internally, which looks like a pervasive interface change.

No, AFAIK, the consensus we reached with the ZFS team is that, ?since
the DMU does not have any concept of users or groups, it will track
space usage associated with opaque identifiers, so that when we write to
a file we would give it 2 identifiers which, for us, one of them would
map to a user and the other one to a group.

> I absolutely agree that DMU has to do space _accounting_ internally. The
> question is how to store results of this accounting, without bothering
> DMU with higher level concepts such as a user or a group identifier.

I really don''t think we should allow the consumer to write to a txg
which is already in the syncing phase, I think the DMU should store the
accounting itself.

> I think that utility of DMU as a universal back-end would improve if it
> were to export an interface allowing its users to update, with certain
> restrictions, on-disk state when transaction group is in sync (i.e.,
> interface similar to one that is internally used for spacemaps).

Hmm.. I''m not sure if that would be very useful, why not write the data
when the txg was open in the first place? ??Maybe you can give a better
example?

For things that requires knowledge of DMU internals (like space
accounting, spacemaps, ...) it shouldn''t be the DMU consumer that has
to
write during the txg sync phase, it should be the DMU because only the
DMU should know about its internals.

The example you have given (spacemaps) is the worst of all, because
spacemap updates are rather involved. Due to ?COW and to the ZIO
pipeline design, spacemap modifications lead to a chicken-and-egg
problem with transactional updates:

When you modify a space map, you create a ZIO which just before writing
leads to an allocation (due to COW).  But since you need to do an
allocation, you need to change the spacemap again, which leads to
another allocation (and also leads to free the old just-written block),
so you need to update the space map again, and so on and on.. (!)
This is why txgs need to converge and why after a few phases it gives up
freeing blocks, and starts re-using blocks which were freed on the same
txg.

Cheers,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/ed189e64/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/ed189e64/attachment.gif

Nikita Danilov

2008-May-28 20:06 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:
 > On Qua, 2008-05-28 at 20:22 +0400, Nikita Danilov wrote:

[...]

 > 
 > I''m not sure if you are describing it incorrectly or just using
the same
 > terms for different concepts, but in any case, blocks are allocated
 > *while* the transaction group is syncing, and due to compression and
 > online pool configuration changes it is impossible to know the exact
 > on-disk space a given block will use until the transaction group is
 > actually syncing.

I meant that the table mentioned below cannot grow, while transaction
group is syncing, which means that dmu has to calculate size of the
table in advance.

[...]

 > 
 > > Note that dmu has to know about users and groups to implement quota
 > > internally, which looks like a pervasive interface change.
 > 
 > 
 > No, AFAIK, the consensus we reached with the ZFS team is that, ???since
 > the DMU does not have any concept of users or groups, it will track
 > space usage associated with opaque identifiers, so that when we write to
 > a file we would give it 2 identifiers which, for us, one of them would
 > map to a user and the other one to a group.

Well... that''s just renaming uid and gid into opaqueid0 and
opaqueid1. :-)

So on one hand we have to add a couple of parameters to all dmu entry
points that can allocate disk space. On the other hand we have something
like

typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid, long bytes);

void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t cb);

with dmu calling registered call-back when blocks are actually allocated
to the object. Advantage of the later interface is that dmu implements
only mechanism, and policy ("user quotas" and "group
quotas") is left to
the upper layers to implement.

[...]

 > 
 > I really don''t think we should allow the consumer to write to a
txg
 > which is already in the syncing phase, I think the DMU should store the
 > accounting itself.

One important aspect of lustre quota requirements that wasn''t mentioned
so far is that Lustre needs something more than -EDQUOT from the file
system. For example, to integrate quotas with dirty cache grants, server
has to know how much quota is left, to redistribute quota across ost''s
it has to modify quota, etc. If quota management and storage are
completely encapsulated within dmu, then dmu has to provide full quota
control interface too, and that interface has to be exported from osd
upward. For one thing, implementation of this interface is going to take
a lot of time.

[...]

 > 
 > For things that requires knowledge of DMU internals (like space
 > accounting, spacemaps, ...) it shouldn''t be the DMU consumer that
has to
 > write during the txg sync phase, it should be the DMU because only the
 > DMU should know about its internals.

I don''t quite understand this argument. DMU already has an interface to
capture a buffer into a transaction and to modify it within this
transaction. An interface to modify a buffer after transaction was
closed, but before it is committed is no more "internal" than the
first
one. It is just places more restrictions on what consumer is allowed to
do with the buffer.

 > When you modify a space map, you create a ZIO which just before writing
 > leads to an allocation (due to COW).  But since you need to do an
 > allocation, you need to change the spacemap again, which leads to
 > another allocation (and also leads to free the old just-written block),
 > so you need to update the space map again, and so on and on.. (!)
 > This is why txgs need to converge and why after a few phases it gives up
 > freeing blocks, and starts re-using blocks which were freed on the same
 > txg.

Good heavens. :-)

 > 
 > Cheers,
 > Ricardo

Nikita.

Ricardo M. Correia

2008-May-28 21:07 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Qui, 2008-05-29 at 00:06 +0400, Nikita Danilov wrote:
> Well... that''s just renaming uid and gid into opaqueid0 and
> opaqueid1. :-)

It may be simple for POSIX uids/gids, but maybe not for CIFS user ids
(but since there is a mapping table it may be also be simple, I''m not
sure).
I think it would make sense to give it a list of opaque ids, instead of
just opaqueid0 and opaqueid1 (maybe a file could belong to multiple
groups or maybe we can think of other creative ideas in the future..).

> So on one hand we have to add a couple of parameters to all dmu entry
> points that can allocate disk space. On the other hand we have
> something
> like
> 
> typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid,
> long bytes);
> 
> void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t
> cb);
> 
> with dmu calling registered call-back when blocks are actually
> allocated
> to the object. Advantage of the later interface is that dmu implements
> only mechanism, and policy ("user quotas" and "group
quotas") is left
> to
> the upper layers to implement.

I don''t see why that would be an advantage over what we had planned to
do.
The plan we discussed with the ZFS team was to make the DMU do space
accounting internally by opaque ids, so the quota policy/enforcement
would still be left to the upper layers to implement.

> If quota management and storage are
> completely encapsulated within dmu, then dmu has to provide full quota
> control interface too, and that interface has to be exported from osd
> upward. For one thing, implementation of this interface is going to take
> a lot of time.

Again, the plan was for the DMU to do only space accounting, the actual
quota management and enforcement would be implemented in Lustre.

>  > For things that requires knowledge of DMU internals (like space
>  > accounting, spacemaps, ...) it shouldn''t be the DMU consumer
that has to
>  > write during the txg sync phase, it should be the DMU because only
the
>  > DMU should know about its internals.
> 
> I don''t quite understand this argument. DMU already has an
interface to
> capture a buffer into a transaction and to modify it within this
> transaction. An interface to modify a buffer after transaction was
> closed, but before it is committed is no more "internal" than the
first
> one. It is just places more restrictions on what consumer is allowed to
> do with the buffer.

What I mean is that IMO a consumer of a filesystem shouldn''t have to
know intimate details of how the filesystem (in this case, the DMU)
works.
For instance, so far Lustre had no idea that transactions are actually
grouped into transaction groups and it had no idea about transaction
group states.

Allowing modification of buffers by an upper layer when a transaction
group is already syncing is not a very elegant way to solve this IMHO
(compared with our previous plan).. :-)

Cheers,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/c1d8c970/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/c1d8c970/attachment-0001.gif

Nikita Danilov

2008-May-28 21:11 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:

[...]

 > The plan we discussed with the ZFS team was to make the DMU do space
 > accounting internally by opaque ids, so the quota policy/enforcement
 > would still be left to the upper layers to implement.

Hm.. seems I am confused. If ZFS stores quota table internally, how this
table is available to the upper layers, to implement policy?

Nikita.

Ricardo M. Correia

2008-May-28 21:33 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Qui, 2008-05-29 at 01:11 +0400, Nikita Danilov wrote:
> Hm.. seems I am confused. If ZFS stores quota table internally, how this
> table is available to the upper layers, to implement policy?

uint64_t dmu_get_space_usage(objset, opaque_id)?

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/87c8d71f/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/87c8d71f/attachment.gif

Nikita Danilov

2008-May-29 08:39 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:
 > On Qui, 2008-05-29 at 01:11 +0400, Nikita Danilov wrote:
 > 
 > > Hm.. seems I am confused. If ZFS stores quota table internally, how
this
 > > table is available to the upper layers, to implement policy?
 > 
 > 
 > uint64_t dmu_get_space_usage(objset, opaque_id)?

It is not immediately clear how chown and chgrp can be implemented on
top of this. Plus an interface to iterate over this table is most likely
required for quota tools.

To return to the question why using call-backs notifying about object
space allocation changes is preferable to maintaining space allocation
per-id in file system: opaqueid''s have to be visible not only in the
dmu
interface, but also in osd interface, because quota policy is
implemented above osd. But 

    - osd knows nothing about users and groups (it uses capability-based
      access control), and its interface has to be expanded to pass
      through identifiers that it doesn''t understand and
doesn''t
      use. This looks wrong. osd operates on objects, identified by
      fids, and it would be much more natural to do space accounting on
      the object granularity.

    - osd interface is identical on all platforms, so, in effect, zfs
      space accounting interface is enforced on ldiskfs (and on all
      possible future back-ends).

 > Ricardo Manuel Correia
 > Lustre Engineering

Nikita.

Peter Braam

2008-May-31 10:25 UTC

head link

[Lustre-devel] Moving forward on Quotas

Regardless of how file quota has to be implemented, we do need file quota,
just to be clear.

Peter


On 5/28/08 11:24 PM, "Nikita Danilov" <Nikita.Danilov at
Sun.COM> wrote:
> Peter Braam writes:
>> 
>> Why do they not need file quota - what if someone wants to control file
>> count?
> 
> Idea is that support for the inode quota is not necessary is DMU, as
> upper layer can implement it by itself. Compare this with block quotas
> that do require support from DMU, because upper layer cannot generally
> tell overwrite from new-write.
> 
>>> 
> 
> Nikita.
> 
>>

Ricardo M. Correia

2008-May-31 15:31 UTC

head link

[Lustre-devel] Moving forward on Quotas

Hi Nikita,

On Sex, 2008-05-30 at 20:38 +0400, Nikita Danilov wrote: 
> What about the following:
> 
>     - dmu tracks per-object `space usage'', in addition to usual
block
>       count as reported by st_blocks.

Currently, the space reported by st_blocks is calculated from
dnode_phys_t->dn_used, which in recent SPA versions tracks the number of
allocated bytes (not blocks) of a DMU object, which is accurate up to
the last committed txg.
Is this what you mean by "space usage"?

> - when space is actually allocated during transaction sync, dmu
>       notifies its user about changes in space usage by invoking some
> 
>           void (*space_usage)(objset_t *os, __u64 objid, __s64 delta);
> 
>       call-back, registered by user.

Ok. 

> - user updates its data-structures in the context of the currently
>       open transaction.

Ok.

> - dmu internally updates space usage information in the context of
>       transaction being synced.

This is being done per-object already.

> - it also records a list (let''s call this "pending
list") of all
>       object whose space allocation changed in the context of the same
>       transaction.

Ok, this is where I am starting not to like.. :) 

> - after a mount, dmu calls ->space_usage() against all objects in
>       the pending lists of last committed transaction group, to update
>       client''s data-structures that are possibly stale due to the
loss
> 
>       of next transaction group.

What do you mean by mount? Do you mean when starting an OST?

> Do you think that might work?

?If I understood correctly, the pending list you propose sounds like a
recovery mechanism (similar to a log) which I don''t think is the right
way to implement this.

First of all, I think you would need to keep track of objects changed in
the last 2 synced transaction groups, not just the last one. The reason
is that when the DMU is syncing transaction group N, it is likely that
you can only be writing to transaction group N+2, because transaction
group N+1 may already be quiescing. This presents a challenge because if
the machines crashes, you may lose data in 2 transaction groups, not
just 1, which I think would make things harder to recover..

Another problem it this: let''s say the DMU is syncing a transaction
group, and starts calling ->space_usage() for objects. Now the machine
crashes, and comes up again.
Now how do you distinguish between which objects were called
->space_usage() in the transaction group that was syncing and which
weren''t (or how would you differentiate between ->space_usage()
calls of
txg N and those of txg N+1)? At a minimum, you would need a txg
parameter in ->space_usage(), which again is leaking a bit internal
knowledge of how the DMU works outside the DMU (and which we may not
assume will always work the same way in future versions).

?Another thing that comes to mind is that the pending list is something
very problem-specific and that would only be useful for Lustre, not
other consumers, so the ZFS team may object to this..
For example, for implementing uid/gid quotas in ZFS, there is no need
for such a mechanism..

And furthermore, I think this kind of recovery could be better
implemented using commit callbacks, which is an abstraction already
designed for recovery purposes and which is backend-agnostic.

Ok, now stepping outside of the pending list (which I may have not
understood the purpose correctly at all :-), I think implementing quotas
in ZFS is harder that it may look at first sight.

For example, let''s say you have 1 MB of quota left. How do you
determine
how much data you can write before the quota runs out?
This may shock you, but depending on the pool configuration, filesystem
properties and object block size, writing 1 MB of file data can take
anywhere from exactly 0 bytes to 9.25 MB of allocated space (!!).

Now let''s scale this up and imagine you have 1 GB of quota left, and
you
write 1 GB of data (and you do this sufficiently fast enough). In the
worst case scenario, you could end up going 8.25 GB over the limit,
which goes against any possible wish of having fine-grained quotas.. :-)

BTW, this reminds me to that I am almost sure our uOSS grants code is
wrong (I have not been assigned as an inspector, so I can''t say how bad
it is..).

Perhaps I am making concentrating too much on correctness.. maybe going
over a quota is not too big of a deal, I remember some conversations
between Andreas and the ZFS team which implied that not having 100%
correctness is not too big of a problem. ?However, I am not so sure
about grants.. :/

Regards,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/cd2f9f1f/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/cd2f9f1f/attachment-0001.gif

Ricardo M. Correia

2008-May-31 15:49 UTC

head link

[Lustre-devel] Moving forward on Quotas

On S?b, 2008-05-31 at 16:31 +0100, Ricardo M. Correia wrote:
> This may shock you, but depending on the pool configuration,
> filesystem properties and object block size, writing 1 MB of file data
> can take anywhere from exactly 0 bytes to 9.25 MB of allocated space
> (!!).

Let me correct that: in the very worst case scenario, writing 1 MB of
file data can actually consume (a bit more than) 11.25 MB of allocated
space..

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/4031336e/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/4031336e/attachment.gif

Nikita Danilov

2008-May-31 16:19 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:
 > Hi Nikita,

Hello,

(I reordered some of the comments below.)

 > Currently, the space reported by st_blocks is calculated from
 > dnode_phys_t->dn_used, which in recent SPA versions tracks the number
of
 > allocated bytes (not blocks) of a DMU object, which is accurate up to
 > the last committed txg.
 > Is this what you mean by "space usage"?

I meant a counter of bytes or blocks that this object occupies for the
quota purposes. I specifically don''t want to identify `space
usage'' with
st_blocks, because for the modern file systems there is no _the_ way to
define what to include into quota: users want quota to be consistent
with both df(1) and du(1) and in the presence of features like snapshots
this is not generally possible.

 > What do you mean by mount? Do you mean when starting an OST?

Yes, OST or MDT.

 > First of all, I think you would need to keep track of objects changed in
 > the last 2 synced transaction groups, not just the last one. The reason

Indeed, I omitted this for the sake of clarity.

 > group N+1 may already be quiescing. This presents a challenge because if
 > the machines crashes, you may lose data in 2 transaction groups, not
 > just 1, which I think would make things harder to recover..

Won''t it be enough to record in the pending list object from two last
transaction groups, if necessary?

 > 
 > Another problem it this: let''s say the DMU is syncing a
transaction
 > group, and starts calling ->space_usage() for objects. Now the machine
 > crashes, and comes up again.
 > Now how do you distinguish between which objects were called
 > ->space_usage() in the transaction group that was syncing and which
 > weren''t (or how would you differentiate between
->space_usage() calls of

But we don''t have to, if we make ->space_usage() idempotent, i.e.,
taking an absolute space usage as a last argument, rather than delta. In
that case, DMU is free to call it multiple times, and client has to cope
with this. (Hmm... I am pretty sure this is what I was thinking about
when composing previous message, but confusing signed __s64 delta
somehow got in, sorry.)

 > >     - dmu internally updates space usage information in the context
of
 > >       transaction being synced.
 > 
 > 
 > This is being done per-object already.

Aha, this simplifies the whole story significantly. If dmu already
maintains for every object a space usage counter that is suitable for
the quota, then `pending list'' can be maintained by dmu client, without
any additional support from dmu:

    - when (as a part of open transaction), client does an operation
      that can potentially modify space usage it adds object identifier
      into pending list, implemented as a normal dmu object;

    - when disk space is actually allocated (transaction group is in
      sync mode), client gets ->space_usage() call-back as above;

    - on a `mount'' client scans pending list object, fetches space
usage
      from the dmu, updates client''s internal data-structures, and
      prunes the pending list.

Of course, again, pending log has to keep track of objects modified in
the last 2 transaction groups.

With the help of commit call-back even ->space_usage() seems
unnecessary, because on a commit time client can scan pending list (in
memory). Heh, it seems that the quota can be implemented completely
outside of the dmu.

 > 
 > And furthermore, I think this kind of recovery could be better
 > implemented using commit callbacks, which is an abstraction already
 > designed for recovery purposes and which is backend-agnostic.

Sounds interesting, can you elaborate on this?

 > Perhaps I am making concentrating too much on correctness.. maybe going
 > over a quota is not too big of a deal, I remember some conversations
 > between Andreas and the ZFS team which implied that not having 100%
 > correctness is not too big of a problem. ???However, I am not so sure
 > about grants.. :/

It''s my impression too that the agreement was to sacrifice some degree
of correctness to simplify implementation.

 > 
 > Regards,
 > Ricardo

Nikita.

Ricardo M. Correia

2008-May-31 17:19 UTC

head link

[Lustre-devel] Moving forward on Quotas

On S?b, 2008-05-31 at 20:19 +0400, Nikita Danilov wrote:
> I meant a counter of bytes or blocks that this object occupies for the
> quota purposes. I specifically don''t want to identify `space
usage'' with
> st_blocks, because for the modern file systems there is no _the_ way to
> define what to include into quota: users want quota to be consistent
> with both df(1) and du(1) and in the presence of features like snapshots
> this is not generally possible.

I think dnode_phys_t->dn_used can be used for this, because AFAICS it
keeps track of allocated space referenced by the active filesystem (in
other words, it does not include space which is referenced only by
snapshots).

I am assuming snapshots should not have any effect on quotas, right?

>  > group N+1 may already be quiescing. This presents a challenge because
if
>  > the machines crashes, you may lose data in 2 transaction groups, not
>  > just 1, which I think would make things harder to recover..
> 
> Won''t it be enough to record in the pending list object from two
last
> transaction groups, if necessary?

Hmm.. I think so, but I think we should not rely on this always being 2,
I think we should allow the list to have an unbounded size, and let the
commit callbacks notify when an entry can be pruned from the list.

> But we don''t have to, if we make ->space_usage() idempotent,
i.e.,
> taking an absolute space usage as a last argument, rather than delta. In
> that case, DMU is free to call it multiple times, and client has to cope
> with this. (Hmm... I am pretty sure this is what I was thinking about
> when composing previous message, but confusing signed __s64 delta
> somehow got in, sorry.)

Ok, that clears my previous concern.
But in this case, how do you know how much you need to add or subtract
to a quota when an object changes size?
I am guessing that you''d need to at least write the previous object
size
as part of the pending list, so that when you''re recovering
you''d know
the delta..

Heh, to me this whole thing sounds quite complicated to get right, but I
think it could work (but of course, fine-grained hard quotas is another
matter altogether..)..

>  > And furthermore, I think this kind of recovery could be better
>  > implemented using commit callbacks, which is an abstraction already
>  > designed for recovery purposes and which is backend-agnostic.
> 
> Sounds interesting, can you elaborate on this?

I was thinking that commit callbacks could be used by the DMU consumer
to solve this problem instead of having an internal DMU list/log, and I
guess you''ve already figured out how that could be done :)

Cheers,
Ricardo
--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/18819888/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/18819888/attachment.gif

Nikita Danilov

2008-May-31 19:11 UTC

head link

[Lustre-devel] Moving forward on Quotas

Ricardo M. Correia writes:

 > I am assuming snapshots should not have any effect on quotas, right?

That''s the problem: on the one hand, users want to use quota to control
actual disk usage, e.g., to not run out of disk space. For this, one
wants to include snapshots into quota. On the other hand, space in
snapshots is not easily reclaimable, which is not very convenient,
because user might be without a way to clean up space. Note that this
situation is possible without snapshots: other user can hard-link your
file from a directory to which you have no access.

 > 
 > Hmm.. I think so, but I think we should not rely on this always being 2,
 > I think we should allow the list to have an unbounded size, and let the
 > commit callbacks notify when an entry can be pruned from the list.

Right.

 > > But we don''t have to, if we make ->space_usage()
idempotent, i.e.,
 > > taking an absolute space usage as a last argument, rather than delta.
In
 > > that case, DMU is free to call it multiple times, and client has to
cope
 > > with this. (Hmm... I am pretty sure this is what I was thinking about
 > > when composing previous message, but confusing signed __s64 delta
 > > somehow got in, sorry.)
 > 
 > 
 > Ok, that clears my previous concern.
 > But in this case, how do you know how much you need to add or subtract
 > to a quota when an object changes size?
 > I am guessing that you''d need to at least write the previous
object size
 > as part of the pending list, so that when you''re recovering
you''d know
 > the delta..

Yes, something similar to what data-bases do for redo logging, where the
same log entry can be replayed multiple times.

 > 
 > Cheers,
 > Ricardo
 > --

Nikita.

Peter Braam

2008-Jun-01 02:26 UTC

head link

[Lustre-devel] Moving forward on Quotas

Jeff - 

could you get in touch with Nikita and Ricardo and assist them with a draft
of quota design for the DMU.  Nikita has some interesting API proposals, but
there are some pretty deep ZFS issues involved where help would be welcome,
as far as I can see.

Just as a heads up, quota in systems like Lustre is quite a difficult issue,
as many servers contribute to quota usage and this needs "acquire",
and
"release" of quota in reasonable chunks to avoid the server server
protocol
getting too chatty.

Thank you for your help!

Peter


On 5/28/08 10:54 PM, "Nikita Danilov" <Nikita.Danilov at
Sun.COM> wrote:
> Ricardo M. Correia writes:
>> On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote:
>> 
>>>> Going aside, if I were designing quota from the scratch right
now, I
>>>> would implement it completely inside of Lustre. All that is
needed for
>>>> such an implementation is a set of call-backs that local
file-system
>>>> invokes when it allocates/frees blocks (or inodes) for a given
>>>> object. Lustre would use these call-backs to transactionally
update
>>>> local quota in its own format. That would save us a lot of
hassle we
>>>> have dealing with the changing kernel quota interfaces, uid
re-mappings,
>>>> and subtle differences between quota implementations on a
different file
>>>> systems.
>>> 
>>> ======> IMPORTANT: get in touch with Jeff Bonwick now,
let''s get quota
>>> implemented in this way in DMU then.
>> 
>> 
>> I think this was proposed by Alex before, but AFAIU the conclusion is
>> that this was not possible to do with ZFS (or at least, not easy to
do).
>> 
>> The problem is that ZFS uses delayed allocations, i.e., allocations
>> occur long after a transaction group has been closed, and therefore we
>> can''t transactionally keep track of allocated space because by
the time
>> the callbacks were called we are not allowed to write to the
transaction
>> group anymore, since another 2 txgs could have been opened already.
> 
> But that problem has to be solved anyway to implement per-user quotas
> for ZFS, correct?
> 
> One possible solution I see is to use something like ZIL to log
> operations in the context of current transaction group. This log can be
> replayed during mount to update quota file.
> 
>> 
>> Since this couldn''t be done transactionally, if the node
crashes, there
>> would be no way of knowing how many blocks had been allocated on the
>> latest (actually, the latest 2) committed transaction groups..
>> 
>> Regards,
>> Ricardo
> 
> Nikita.

Peter Braam

2008-Jun-01 02:32 UTC

head link

[Lustre-devel] Moving forward on Quotas

I am quite worried about the dynamic qunit patch.
I am not convinced I want smaller qunits to stick around.

Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise
they create too much server - server overhead.  The cost of 100MB of disk
space is barely more than a cent now; what are we trying to address withtiny
qunits?

Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to
100TB/sec in I/O.  Calculate quota RPC traffic from that.  A server cannot
handle more than 15,000 RPC''s / sec.

No arguing, or opinions here, numbers please.  The original design I did 4
years ago limited quota calls from one OSS to the master to one per second.
Qunits were made adaptive without solid reasoning or design.

Peter

On 5/28/08 4:06 PM, "Johann Lombardi" <johann at sun.com> wrote:
> Hello Peter,
> 
> On Tue, May 27, 2008 at 07:28:10AM +0800, Peter Braam wrote:
>>>> When a slave runs out of its local quota, it sends an acquire
request to
>>>> the
>>>> quota master. As I said earlier, the quota master is the only
one having a
>>>> global overview of what has been granted to slaves. If the
master can
>>>> satisfy
>>>> the request, it grants a qunit (can be a number of blocks or
inodes) to the
>>>> slave. The problem is that an OST can return "quota
exceeded" (=EDQUOT)
>>>> whereas
>>>> another OST is still having quotas. There is currently no
callback to claim
>>>> back the quota space that has been granted to a slave.
>> 
>> Hmm - the slave should release quota.
> 
> I don''t think that the slave can make such a decision by itself
since it does
> not know that we are getting closer to the global quota limit. Only the
master
> is aware of this.
> Actually, the scenario I described above can no longer happen - with recent
> lustre versions at least - thanks to the dynamic qunit patch because the
> master broadcasts to all the slaves the new qunit size when it is shrunk.
> 
> Cheers,
> Johann

Peter Braam

2008-Jun-01 02:36 UTC

head link

[Lustre-devel] Moving forward on Quotas

For CIFS quota and user ids get in touch with Matt Wu and Mike Shapiro.
There are special interfaces in ZFS/DMU for better windows support that we
probably need to leverage.

Peter


On 5/29/08 5:07 AM, "Ricardo M. Correia" <Ricardo.M.Correia at
Sun.COM> wrote:
> On Qui, 2008-05-29 at 00:06 +0400, Nikita Danilov wrote:
>> 
>> Well... that''s just renaming uid and gid into opaqueid0 and
>> opaqueid1. :-)
> 
> It may be simple for POSIX uids/gids, but maybe not for CIFS user ids (but
> since there is a mapping table it may be also be simple, I''m not
sure).
> I think it would make sense to give it a list of opaque ids, instead of
just
> opaqueid0 and opaqueid1 (maybe a file could belong to multiple groups or
maybe
> we can think of other creative ideas in the future..).
> 
>>  So on one hand we have to add a couple of parameters to all dmu entry
>>  points that can allocate disk space. On the other hand we have
something
>>  like
>>  
>>  typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid,
long
>> bytes);
>>  
>>  void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t
cb);
>>  
>>  with dmu calling registered call-back when blocks are actually
allocated
>>  to the object. Advantage of the later interface is that dmu implements
>>  only mechanism, and policy ("user quotas" and "group
quotas") is left to
>>  the upper layers to implement.
> 
> I don''t see why that would be an advantage over what we had
planned to do.
> The plan we discussed with the ZFS team was to make the DMU do space
> accounting internally by opaque ids, so the quota policy/enforcement would
> still be left to the upper layers to implement.
> 
>> 
>> If quota management and storage are
>> completely encapsulated within dmu, then dmu has to provide full quota
>> control interface too, and that interface has to be exported from osd
>> upward. For one thing, implementation of this interface is going to
take
>> a lot of time.
> 
> Again, the plan was for the DMU to do only space accounting, the actual
quota
> management and enforcement would be implemented in Lustre.
> 
> 
>> 
>>>  > For things that requires knowledge of DMU internals (like
space
>>>  > accounting, spacemaps, ...) it shouldn''t be the DMU
consumer that has to
>>>  > write during the txg sync phase, it should be the DMU because
only the
>>>  > DMU should know about its internals.
>> 
>> I don''t quite understand this argument. DMU already has an
interface to
>> capture a buffer into a transaction and to modify it within this
>> transaction. An interface to modify a buffer after transaction was
>> closed, but before it is committed is no more "internal" than
the first
>> one. It is just places more restrictions on what consumer is allowed to
>> do with the buffer.
> 
> What I mean is that IMO a consumer of a filesystem shouldn''t have
to know
> intimate details of how the filesystem (in this case, the DMU) works.
> For instance, so far Lustre had no idea that transactions are actually
grouped
> into transaction groups and it had no idea about transaction group states.
> 
> Allowing modification of buffers by an upper layer when a transaction group
is
> already syncing is not a very elegant way to solve this IMHO (compared with
> our previous plan).. :-)
> 
> Cheers,
> Ricardo
> 
> --
> Ricardo Manuel Correia
> Lustre Engineering
> 
> Sun Microsystems, Inc.
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080601/3683b206/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080601/3683b206/attachment.gif

Peter Braam

2008-Jun-01 02:44 UTC

head link

[Lustre-devel] Moving forward on Quotas

That the total number of quota RPC''s made by all OSS''s (up to
5000 of them)
should not exceed something like 1/3 of this, based on today''s
hardware,
otherwise quota traffic will slow the cluster down.

We have seen big servers do 15,000 updates/sec sustained, so that is a
measure we can use at the moment.

Peter

On 6/1/08 10:41 AM, "Robert Gordon" <Robert.Gordon at Sun.COM>
wrote:
> 
> On May 31, 2008, at 9:32 PM, Peter Braam wrote:
> 
>> A server cannot handle more than 15,000 RPC''s / sec.
> 
> I''m curious (always ;) ) ..
> 
> what do you exactly mean by this statement ?
> 
> Thanks,
> Robert..

Mike Shapiro

2008-Jun-01 03:17 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Sun, Jun 01, 2008 at 10:36:33AM +0800, Peter Braam
wrote:> For CIFS quota and user ids get in touch with Matt Wu and Mike Shapiro.
> There are special interfaces in ZFS/DMU for better windows support that we
> probably need to leverage.
> 
> Peter
For the design doc of how we unified POSIX and Windows SIDs
in the underlying filesystem and kernel see:

opensolaris.org/os/community/arc/caselog/2007/064/final-materials/spec-txt/

The FUID system described there is implemented in ZFS now.  If you are
doing work to add user accounting to ZFS/DMU (which is wonderful news)
all of that should be implemented using the FUID tables, which will
result in it working transparently for both POSIX and Windows IDs.

Any upstack expressions of this that are meant to be persistent for
customers (i.e. a report) or sent over the wire should express the accounted-
for entities using the generic string form of SIDs that I describe in there.

-Mike

-- 
Mike Shapiro, Sun Microsystems Fishworks. blogs.sun.com/mws/

Jeff Bonwick

2008-Jun-01 04:53 UTC

head link

[Lustre-devel] Moving forward on Quotas

I''d suggest working with Matt Ahrens on this.

Jeff

On Sun, Jun 01, 2008 at 10:26:41AM +0800, Peter Braam
wrote:> Jeff - 
> 
> could you get in touch with Nikita and Ricardo and assist them with a draft
> of quota design for the DMU.  Nikita has some interesting API proposals,
but
> there are some pretty deep ZFS issues involved where help would be welcome,
> as far as I can see.
> 
> Just as a heads up, quota in systems like Lustre is quite a difficult
issue,
> as many servers contribute to quota usage and this needs
"acquire", and
> "release" of quota in reasonable chunks to avoid the server
server protocol
> getting too chatty.
> 
> Thank you for your help!
> 
> Peter
> 
> 
> On 5/28/08 10:54 PM, "Nikita Danilov" <Nikita.Danilov at
Sun.COM> wrote:
> 
> > Ricardo M. Correia writes:
> >> On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote:
> >> 
> >>>> Going aside, if I were designing quota from the scratch
right now, I
> >>>> would implement it completely inside of Lustre. All that
is needed for
> >>>> such an implementation is a set of call-backs that local
file-system
> >>>> invokes when it allocates/frees blocks (or inodes) for a
given
> >>>> object. Lustre would use these call-backs to
transactionally update
> >>>> local quota in its own format. That would save us a lot of
hassle we
> >>>> have dealing with the changing kernel quota interfaces,
uid re-mappings,
> >>>> and subtle differences between quota implementations on a
different file
> >>>> systems.
> >>> 
> >>> ======> IMPORTANT: get in touch with Jeff Bonwick now,
let''s get quota
> >>> implemented in this way in DMU then.
> >> 
> >> 
> >> I think this was proposed by Alex before, but AFAIU the conclusion
is
> >> that this was not possible to do with ZFS (or at least, not easy
to do).
> >> 
> >> The problem is that ZFS uses delayed allocations, i.e.,
allocations
> >> occur long after a transaction group has been closed, and
therefore we
> >> can''t transactionally keep track of allocated space
because by the time
> >> the callbacks were called we are not allowed to write to the
transaction
> >> group anymore, since another 2 txgs could have been opened
already.
> > 
> > But that problem has to be solved anyway to implement per-user quotas
> > for ZFS, correct?
> > 
> > One possible solution I see is to use something like ZIL to log
> > operations in the context of current transaction group. This log can
be
> > replayed during mount to update quota file.
> > 
> >> 
> >> Since this couldn''t be done transactionally, if the node
crashes, there
> >> would be no way of knowing how many blocks had been allocated on
the
> >> latest (actually, the latest 2) committed transaction groups..
> >> 
> >> Regards,
> >> Ricardo
> > 
> > Nikita.
> 
>

Nikita Danilov

2008-Jun-01 13:58 UTC

head link

[Lustre-devel] Moving forward on Quotas

Jeff Bonwick writes:
 > I''d suggest working with Matt Ahrens on this.

Hello,

we were discussing recently what is needed from the DMU to implement quotas
and other forms of space accounting. Our basic premise is that it is desirable
to keep DMU part of the quota support at minimum, and to implement only
mechanism here, leaving policy to the upper layers.

Questions of what exactly constitutes disk space usage for quota purposes
(e.g., whether and how snapshots have to be accounted for, etc.) are
orthogonal to the present interface discussion; it''s assumed that
dnode_phys_t::dn_used can be used for this.

Two main use cases are

    - user and group quotas in ZPL, and

    - distributed cluster-wide quota in Lustre and pNFS.

Surprisingly, it seems that these use cases can be implemented without any
additional support from the DMU, except for the commit call-back, that is
needed for other purposes too.

General idea is that DMU user (ZPL or Lustre MDD module) maintains its own
data-base mapping quota consumers (user and group identifiers) to their
current space usage. A mechanism is needed to keep this data-base up to date.

To this end, user keeps in memory a list of all DMU objects to which space was
allocated or deallocated ("pending list"). User can do this, provided
it has
full control of the DMU (which holds for both use cases above). See below on
how pending list is truncated. For each object, pending list also records its
"current space usage" (dnode_phys_t::dn_used), and there is at most
one record
for a given object in this list.

When a transaction group is about to be closed, user finds all objects in the
pending list, belonging to this or previous transaction groups and, in the
context of this transaction group, appends to a special "pending file"
a
record, containing

        (object id, current space usage)

Next, transaction group is synced, disk space is actually allocated to the
objects, and dnode_phys_t::dn_used is modified.

When transaction group has committed, commit call-back is invoked. In this
call-back user scans pending list, and updates its internal quota data-base:

        foreach (object, space_usage) in list {
                upd_quota(object->owner, dnode->dn_used - space_usage);
                upd_quota(object->group, dnode->dn_used - space_usage);
                remove_record_from_list();
        }

Updates to the user''s quota data-base are done in the context of
currently open transaction.

When DMU starts (possibly after a crash), the same loop as above is
executed for all records in the pending file (as visible in last
committed transaction group). Note that this loop is idempotent:
executing it second time after first execution successfully committed
has no result.

In other words, the idea is to update quota data-base on the transaction
group commit (so that updates to the data-base go into the transaction
group in the "future" w.r.t. object operations that resulted in the
space allocation), and to keep in pending file a list of objects
modified during the last 2 transaction groups, so that this file can be
used as a kind of redo log to update quota data-base in case of failure.

 > 
 > Jeff

Nikita.

Johann Lombardi

2008-Jun-02 12:22 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Sun, Jun 01, 2008 at 10:32:46AM +0800, Peter Braam
wrote:> I am quite worried about the dynamic qunit patch.
> I am not convinced I want smaller qunits to stick around.
>
> Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise
> they create too much server - server overhead.
I''ve _not_ been involved in the design of the adaptive qunit feature
(the DLD
pre-dates my involvement with Sun/CFS), but here is how it basically works:
* if remaining quota space < 4 * #osts * current_qunit, the qunit size is
  divided by 2,
* if remaining quota space > 8 * #osts * current_qunit, the qunit size is
  multiplied by 2.
The initial bunit size (also the maximum value) is the default one (i.e. 128MB).
The "4" and "8" can be tuned through /proc and there is a
minimum value for
qunit (by default, 1MB = PTLRPC_MAX_BRW_SIZE for bunit).

Let''s consider a cluster with 500 OSTs:
* the initial qunit size for a particular uid/gid is 128MB (unless the quota
  limit is too low)
* when left_quota = 256GB, bunit is shrunk to 64MB
* when left_quota = 128GB, bunit is shrunk to 32MB
* when left_quota = 64GB, bunit is shrunk to 16MB
* when left_quota = 32GB, bunit is shrunk to 8MB
* when left_quota = 16GB, bunit is shrunk to 4MB
* when left_quota = 8GB, bunit is shrunk to 2MB
* when left_quota = 4GB, bunit is shrunk to 1MB

Similarly, bunit is grown when the remaining quota space hits the same
thresholds. The dynamic qunit patch also maintains an accurate accounting of
how many threads are waiting for quota space from the master. Thus, slaves
can ask for more than one qunit at a time in a single DQACQ request.
IMO, the current algorithm/parameters are probably too aggressive and the
correct tuning has not been found yet.
> The cost of 100MB of disk space is barely more than a cent now; what are we
trying
> to address with tiny qunits?
Today, a couple of customers are asking for accurate quotas. We should probably
discuss with them to understand their motivations.>From my point of view, the interesting feature is not to support small quotalimits or tiny qunits, but to have the ability to adapt qunits for each uid/gid
depending on how much free quota space remains. We can now increase qunit
significantly without hurting quotas accuracy and performance should only be
impacted when getting closer to the quota limit (that was the original goal in
the DLD). That being said, adaptive qunits can be disabled easily by setting
the mininum qunit size to the default qunit size.
> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to
> 100TB/sec in I/O. Calculate quota RPC traffic from that. A server cannot
> handle more than 15,000 RPC''s / sec.
>
> No arguing, or opinions here, numbers please.
With static qunits:
100TB/s / default_bunit_size ~ 1,000,000 RPCs / sec
To get below the 15,000 RPCs/s, we should increase bunit to ~6.7GB.
If each OST acquires 1 qunit ahead of time w/o actually using it, we
"leak"
6.7GB * 5,000 OSTs = 33.5TB.

With adaptive qunits, we can set default bunit to a larger value (e.g. 10GB)
and the mininum bunit to 100MB. This way, quotas can remain "accurate"
(maximum
leak is 500GB) and performane would be impacted (more RPCs sent) only when
getting close to the quota limit.
However, the current shrink/enlarge algorithm is definitely not suitable for
such a big cluster since it decreases qunit too quickly.
> The original design I did 4 years ago limited quota calls from one OSS to
the
> master to one per second.
> Qunits were made adaptive without solid reasoning or design.
IMHO, adaptive qunits is not such a bad feature, even if there is definitely
room for improvements.

Johann

Andreas Dilger

2008-Jun-02 23:24 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Jun 01, 2008  10:32 +0800, Peter J. Braam wrote:> I am quite worried about the dynamic qunit patch.
> I am not convinced I want smaller qunits to stick around.
> 
> Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise
> they create too much server - server overhead.  The cost of 100MB of disk
> space is barely more than a cent now; what are we trying to address
withtiny
> qunits?
> 
> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to
> 100TB/sec in I/O.  Calculate quota RPC traffic from that.  A server cannot
> handle more than 15,000 RPC''s / sec.
> 
> No arguing, or opinions here, numbers please.  The original design I did 4
> years ago limited quota calls from one OSS to the master to one per second.
> Qunits were made adaptive without solid reasoning or design.
Just a note - it isn''t only shrinking of qunits that is possible, but
also
growth of qunits.  I think there was also work done to allow recall of
qunits from the servers, but I''m not sure if it was landed into CVS.

If we are significantly re-architecting quotas, I''d suggest that we
also
re-implement grants at the same time and use the DLM to do both of them.
This way we can have grant + quota on a per-file basis (quota + grant are
given to clients as part of extent lock LVB), and are also able to
recall quota + grant.  We may not even want to have separate quota+grant,
since we track the ownership of files on the OSTs and space allocation
is done on a per-file basis.

It would be possible, for example, to take a user''s whole quota from
the master, split it evenly into "num_osts * 2" chunks at mount time
to pass to the OSTs, they further grant it to clients when they request
extent locks, and then avoid ALL master->OST quota RPCs unless that user
actually got close to exceeding their quota, either granting out some
of the remaining "num_osts" qunits or recalling some of the
outstanding
quota (possibly via lock "conversion" to avoid revoking the quota
lock).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Matthew Ahrens

2008-Jun-03 00:50 UTC

head link

[Lustre-devel] Moving forward on Quotas

Nikita Danilov wrote:> Jeff Bonwick writes:
>  > I''d suggest working with Matt Ahrens on this.
> 
> Hello,
> 
> we were discussing recently what is needed from the DMU to implement quotas
> and other forms of space accounting. Our basic premise is that it is
desirable
> to keep DMU part of the quota support at minimum, and to implement only
> mechanism here, leaving policy to the upper layers.
I agree with this premise.  However, your proposed implementation (especially 
the asynchronous update mechanism and associated pending file) seems 
unnecessarily complicated.

I would suggest that we simply update a "database" (eg. ZAP object or
sparse
array) of userid -> space usage from syncing context when the space is 
allocated/freed (ie, dsl_dataset_block_{born,kill}).  I believe that the 
problems this presents[*] will be more tractable than the method you outlined.

--matt

[*] eg, if the DB object is stored in the user''s objset, then updating
it in
syncing context may be problematic.  if it is stored in the MOS, carrying it 
along when doing snapshot operations will be painful (snapshot, clone, send, 
recv, rollback, etc).

Nikita Danilov

2008-Jun-03 07:49 UTC

head link

[Lustre-devel] Moving forward on Quotas

Matthew Ahrens writes:
 > Nikita Danilov wrote:
 > > Jeff Bonwick writes:
 > >  > I''d suggest working with Matt Ahrens on this.
 > > 
 > > Hello,
 > > 
 > > we were discussing recently what is needed from the DMU to implement
quotas
 > > and other forms of space accounting. Our basic premise is that it is
desirable
 > > to keep DMU part of the quota support at minimum, and to implement
only
 > > mechanism here, leaving policy to the upper layers.
 > 
 > I agree with this premise.  However, your proposed implementation
(especially
 > the asynchronous update mechanism and associated pending file) seems 
 > unnecessarily complicated.
 > 
 > I would suggest that we simply update a "database" (eg. ZAP
object or sparse
 > array) of userid -> space usage from syncing context when the space is 
 > allocated/freed (ie, dsl_dataset_block_{born,kill}).  I believe that the 
 > problems this presents[*] will be more tractable than the method you
outlined.

Indeed, this solution is much simpler, and it was considered
initially. I see following drawbacks in it:

     - a notion of a user identifier (or some opaque identifier) has to
       be introduced in DMU interface. DMU doesn''t interpret these
       identifiers in any way, except for using them as keys in a space
       usage database. A set of these identifiers has to be passed to
       every DMU entry point that might result in space allocation (a
       set is needed because there are group quotas, and to keep
       interface more or less generic).

     - an implementation of chown, chgrp, and distributed quota require
       DMU user to modify this database. Also, an interface to iterate
       over this database is most likely needed for things like
       distributed fsck, and user level quote reporting tools. It seems
       that it would be quite difficult to encapsulate such a database
       within DMU.

 > 
 > --matt
 > 
 > [*] eg, if the DB object is stored in the user''s objset, then
updating it in
 > syncing context may be problematic.  if it is stored in the MOS, carrying
it

The proposal was to update the database in the context of currently open
transaction group. That is, when transaction group T has just committed,
commit call-back is invoked and the database is updated in the context
of some transaction belonging to transaction group T + 2 (T + 1 being in
sync). It is because of this that pending file has to keep track of
objects from two last transaction groups.

 > along when doing snapshot operations will be painful (snapshot, clone,
send,
 > recv, rollback, etc).

Nikita.

Landen tian

2008-Jun-03 08:49 UTC

head link

[Lustre-devel] Moving forward on Quotas

>-----Original Message-----
>From: lustre-devel-bounces at lists.lustre.org
>[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas
Dilger
>Sent: Tuesday, June 03, 2008 7:25 AM
>To: Peter Braam
>Cc: Bryon Neitzel; Johann Lombardi; Peter Bojanic; Jessica A. Johnson; Eric
>Barton; Nikita Danilov; lustre-devel at lists.lustre.org
>Subject: Re: [Lustre-devel] Moving forward on Quotas
>
>On Jun 01, 2008  10:32 +0800, Peter J. Braam wrote:
>> I am quite worried about the dynamic qunit patch.
>> I am not convinced I want smaller qunits to stick around.
>>
>> Please PROVE RIGOROUSLY that qunits are grow large quickly again,
>otherwise
>> they create too much server - server overhead.  The cost of 100MB of
disk
>> space is barely more than a cent now; what are we trying to address
withtiny>> qunits?
>>
>> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up
to
>> 100TB/sec in I/O.  Calculate quota RPC traffic from that.  A server
cannot>> handle more than 15,000 RPC''s / sec.
>>
>> No arguing, or opinions here, numbers please.  The original design I
did
4>> years ago limited quota calls from one OSS to the master to one per
second.>> Qunits were made adaptive without solid reasoning or design.
>
>Just a note - it isn''t only shrinking of qunits that is possible,
but also
>growth of qunits.  I think there was also work done to allow recall of
>qunits from the servers, but I''m not sure if it was landed into
CVS.
Yes, it has. In order to prevent ping-pong effect, if qunit is reduced,
qunit _only_ could be
Increased after the_latest_qunit_reduction + lqc_switch_seconds(default is
300 secs) . 
At designing, we think accuracy is more urgent(otherwise, users will see
earlier -EDQUOT),
so decreasing can be done any time, but increasing has this limitation.

tianzy

Peter Braam

2008-Jun-04 01:24 UTC

head link

[Lustre-devel] Moving forward on Quotas

Here is some more guidance for thinking about the Lustre quota design:

Adaptive qunits are great, but all I see is kind of a hack attempting to get
this right instead of a good design.  Here are some use cases you need to
address, and hopefully address with existing infrastructure.

(A) You need callbacks to change it, so that when it shrinks clients can
give up quota.

(B) mechanisms to recover the correct value if a client reconnects, or
master reboots.  

Starting from a hard coded default value is wrong.  If it''s global,
then
you''d need to store this in the configuration log so that it can be
re-read
and managed when it changes, using the config log.

If it is a per user qunit then we may need an entirely new, similar
mechanism.  It probably is, and this is what worries me - it''s a huge
amount
of work to get this right.

Doing this is a LOT of work, and unless you do it right the implementation
will see a similar pattern of problems with customers as the previous one.

So I want to continue to challenge you by asking if there isn''t a quota
solution that doesn''t require adaptive behavior, at the expense of
small
amounts of unmanaged space.

Peter

On 6/3/08 5:49 PM, "Landen tian" <Zhiyong.Tian at Sun.COM>
wrote:
>> -----Original Message-----
>> From: lustre-devel-bounces at lists.lustre.org
>> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas
Dilger
>> Sent: Tuesday, June 03, 2008 7:25 AM
>> To: Peter Braam
>> Cc: Bryon Neitzel; Johann Lombardi; Peter Bojanic; Jessica A. Johnson;
Eric
>> Barton; Nikita Danilov; lustre-devel at lists.lustre.org
>> Subject: Re: [Lustre-devel] Moving forward on Quotas
>> 
>> On Jun 01, 2008  10:32 +0800, Peter J. Braam wrote:
>>> I am quite worried about the dynamic qunit patch.
>>> I am not convinced I want smaller qunits to stick around.
>>> 
>>> Please PROVE RIGOROUSLY that qunits are grow large quickly again,
>> otherwise
>>> they create too much server - server overhead.  The cost of 100MB
of disk
>>> space is barely more than a cent now; what are we trying to address
> withtiny
>>> qunits?
>>> 
>>> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and
up to
>>> 100TB/sec in I/O.  Calculate quota RPC traffic from that.  A server
> cannot
>>> handle more than 15,000 RPC''s / sec.
>>> 
>>> No arguing, or opinions here, numbers please.  The original design
I did
> 4
>>> years ago limited quota calls from one OSS to the master to one per
> second.
>>> Qunits were made adaptive without solid reasoning or design.
>> 
>> Just a note - it isn''t only shrinking of qunits that is
possible, but also
>> growth of qunits.  I think there was also work done to allow recall of
>> qunits from the servers, but I''m not sure if it was landed
into CVS.
> 
> Yes, it has. In order to prevent ping-pong effect, if qunit is reduced,
> qunit _only_ could be
> Increased after the_latest_qunit_reduction + lqc_switch_seconds(default is
> 300 secs) . 
> At designing, we think accuracy is more urgent(otherwise, users will see
> earlier -EDQUOT),
> so decreasing can be done any time, but increasing has this limitation.
> 
> tianzy
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Landen tian

2008-Jun-04 07:05 UTC

head link

[Lustre-devel] Moving forward on Quotas

>-----Original Message-----
>From: lustre-devel-bounces at lists.lustre.org
>[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Peter Braam
>Sent: Wednesday, June 04, 2008 9:24 AM
>To: Landen tian; ''Andreas Dilger''
>Cc: ''Bryon Neitzel''; ''Johann Lombardi'';
''Peter Bojanic''; ''Jessica A.
Johnson''; ''Eric>Barton''; ''Nikita Danilov''; lustre-devel at
lists.lustre.org
>Subject: Re: [Lustre-devel] Moving forward on Quotas
>
>Here is some more guidance for thinking about the Lustre quota design:
>
>Adaptive qunits are great, but all I see is kind of a hack attempting to
get>this right instead of a good design.  Here are some use cases you need to
>address, and hopefully address with existing infrastructure.
>
>(A) You need callbacks to change it, so that when it shrinks clients can
>give up quota.
All remained quota on quota slaves(osts etc) must be kept between [0.5qunit,
1.5qunit].
If quota slaves get a shrunken qunit, it will check if left quota on itself
satisfies
this limitation. If not, it will release some quota.

As we can''t predict users will use which ost, we will give a qunit to
every
ost at the
beginning so that they can use it when writes relative to quota comes later.
So if we
want to shrink qunit, we need shrink all osts. 
>
>(B) mechanisms to recover the correct value if a client reconnects, or
>master reboots.
>
>Starting from a hard coded default value is wrong.  If it''s global,
then
>you''d need to store this in the configuration log so that it can be
re-read
>and managed when it changes, using the config log.
When every quota req arrives, quota master(mds) will recalculate the qunit
to decide
if it would enlarge or shrink a qunit to a proper value(this computing is
simple, the cost is low).
After rebooting, mds will know the proper qunit after the first quota req is
finished; 
As current qunit will be contained in the reply of quota req, osts will know
current qunit
after a quota req is finished. In this way, no matter a client reconnects or
master reboot,
the proper quota info is spread over all cluster gradually. Now, the qunit
info only is recoreded in 
memory, not disk. After rebooting or reconnecting, it will be recovered in
running time.

Certainly, we can record it in the config log. It would be easier to
reconnect and reboot, 
but there may be hundreds or thousands of quota usrs/grps In the system, and
we may flush this info to log from time to time because this info is
dynamic. We have 
gains and losses, please advice if it is worthy.
>
>If it is a per user qunit then we may need an entirely new, similar
>mechanism.  It probably is, and this is what worries me - it''s a
huge
amount>of work to get this right.
Yeah, adaptive qunit is per user and per group.
>Doing this is a LOT of work, and unless you do it right the implementation
>will see a similar pattern of problems with customers as the previous one.
>
>So I want to continue to challenge you by asking if there isn''t a
quota
>solution that doesn''t require adaptive behavior, at the expense of
small
>amounts of unmanaged space.
I guess we need borrow mechanism from ldlm to achieve it as Johann said
before.

tianzy

Johann Lombardi

2008-Jun-04 08:26 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Mon, Jun 02, 2008 at 05:24:33PM -0600, Andreas Dilger
wrote:> Just a note - it isn''t only shrinking of qunits that is possible,
but also
> growth of qunits.  I think there was also work done to allow recall of
> qunits from the servers, but I''m not sure if it was landed into
CVS.
Yes, this is included in the adaptive qunit patch. When qunit is shrunk,
the new value is broadcasted to the slaves which release the unused qunits.
> If we are significantly re-architecting quotas, I''d suggest that
we also
> re-implement grants at the same time and use the DLM to do both of them.
> This way we can have grant + quota on a per-file basis (quota + grant are
> given to clients as part of extent lock LVB), and are also able to
> recall quota + grant.  We may not even want to have separate quota+grant,
> since we track the ownership of files on the OSTs and space allocation
> is done on a per-file basis.
>
> It would be possible, for example, to take a user''s whole quota
from
> the master, split it evenly into "num_osts * 2" chunks at mount
time
> to pass to the OSTs, they further grant it to clients when they request
> extent locks, and then avoid ALL master->OST quota RPCs unless that user
> actually got close to exceeding their quota, either granting out some
> of the remaining "num_osts" qunits or recalling some of the
outstanding
> quota (possibly via lock "conversion" to avoid revoking the quota
lock).
I''ve been in favor of such an architecture too (see my previous
emails).
The only "problem" is that it requires quite a lot of work.

Johann

Matthew Ahrens

2008-Jun-04 23:50 UTC

head link

[Lustre-devel] Moving forward on Quotas

Nikita Danilov wrote:> Matthew Ahrens writes:
>  > Nikita Danilov wrote:
>  > > Jeff Bonwick writes:
>  > >  > I''d suggest working with Matt Ahrens on this.
>  > > 
>  > > Hello,
>  > > 
>  > > we were discussing recently what is needed from the DMU to
implement quotas
>  > > and other forms of space accounting. Our basic premise is that
it is desirable
>  > > to keep DMU part of the quota support at minimum, and to
implement only
>  > > mechanism here, leaving policy to the upper layers.
>  > 
>  > I agree with this premise.  However, your proposed implementation
(especially
>  > the asynchronous update mechanism and associated pending file) seems 
>  > unnecessarily complicated.
>  > 
>  > I would suggest that we simply update a "database" (eg. ZAP
object or sparse
>  > array) of userid -> space usage from syncing context when the
space is
>  > allocated/freed (ie, dsl_dataset_block_{born,kill}).  I believe that
the
>  > problems this presents[*] will be more tractable than the method you
outlined.
> 
> Indeed, this solution is much simpler, and it was considered
> initially. I see following drawbacks in it:
Agreed, those are possible drawbacks, depending on the implementation.  For 
example, if the DB object is stored in the user''s objset (which is
preferable
for other reasons) then I suspect that the two drawbacks you mention below 
will be no worse than in your proposal.

--matt
>      - a notion of a user identifier (or some opaque identifier) has to
>        be introduced in DMU interface. DMU doesn''t interpret these
>        identifiers in any way, except for using them as keys in a space
>        usage database. A set of these identifiers has to be passed to
>        every DMU entry point that might result in space allocation (a
>        set is needed because there are group quotas, and to keep
>        interface more or less generic).
> 
>      - an implementation of chown, chgrp, and distributed quota require
>        DMU user to modify this database. Also, an interface to iterate
>        over this database is most likely needed for things like
>        distributed fsck, and user level quote reporting tools. It seems
>        that it would be quite difficult to encapsulate such a database
>        within DMU.
> 
>  > 
>  > --matt
>  > 
>  > [*] eg, if the DB object is stored in the user''s objset,
then updating it in
>  > syncing context may be problematic.  if it is stored in the MOS,
carrying it
> 
> The proposal was to update the database in the context of currently open
> transaction group. That is, when transaction group T has just committed,
> commit call-back is invoked and the database is updated in the context
> of some transaction belonging to transaction group T + 2 (T + 1 being in
> sync). It is because of this that pending file has to keep track of
> objects from two last transaction groups.
> 
>  > along when doing snapshot operations will be painful (snapshot,
clone, send,
>  > recv, rollback, etc).
> 
> Nikita.

Peter Braam

2008-Jun-05 11:09 UTC

head link

[Lustre-devel] Moving forward on Quotas

Quota are determined by the owner of the file, not by the uid of the
process.

We have been propagating owners / gid of files from MDS to OSS for this
reason.

I don''t understand the issue.

Peter

> 
> I''ve discussed with Fanyong and imo, the uid mapping issue is
clearly a
> showstopper for now. On HEAD, the client-side uid/gid packed by GSS is
> translated to a sevice-side uid/gid. This mapping exists on the MDS, but
not
> on the OSSs. The problem is that for quotas, OSSs must be aware of this
> mapping.
> We should find a solution to this problem (not straightforward) which needs
to
> be addressed anyway even for DMU support.
> 
> In general, I don''t think that much of this work will be
"throw away".
> 
> Cheers,
> Johann

Johann Lombardi

2008-Jun-05 12:27 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Thu, Jun 05, 2008 at 04:09:41AM -0700, Peter Braam
wrote:> Quota are determined by the owner of the file, not by the uid of the
> process.
In my understanding, the client is not aware of the server-side uid/gid and
the file owner returned to the client is translated too. Please let me know
if I misunderstood how the uid/gid mapping feature works.
> We have been propagating owners / gid of files from MDS to OSS for this
> reason.
>
> I don''t understand the issue.
The problem is that the owner / group is propagated to the OSTs through the
client the first time it sends a bulk write rpc (by setting the
OBD_MD_FLUID/GID flags) and those identifiers - supplied by the client -
must be translated too.

Johann

Peter Braam

2008-Jun-05 13:45 UTC

head link

[Lustre-devel] Moving forward on Quotas

Well, this protocol hasn''t been published yet; why not include the
server
side uid / gid then?

But more seriously, how is this encoded in such a way that the OST can trust
the information - it must be in the capability too?

Peter


On 6/5/08 5:27 AM, "Johann Lombardi" <johann at sun.com> wrote:
> On Thu, Jun 05, 2008 at 04:09:41AM -0700, Peter Braam wrote:
>> Quota are determined by the owner of the file, not by the uid of the
>> process.
> 
> In my understanding, the client is not aware of the server-side uid/gid and
> the file owner returned to the client is translated too. Please let me know
> if I misunderstood how the uid/gid mapping feature works.
> 
>> We have been propagating owners / gid of files from MDS to OSS for this
>> reason.
>> 
>> I don''t understand the issue.
> 
> The problem is that the owner / group is propagated to the OSTs through the
> client the first time it sends a bulk write rpc (by setting the
> OBD_MD_FLUID/GID flags) and those identifiers - supplied by the client -
> must be translated too.
> 
> Johann
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Johann Lombardi

2008-Jun-06 07:33 UTC

head link

[Lustre-devel] Moving forward on Quotas

On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam
wrote:> Well, this protocol hasn''t been published yet; why not include the
server
> side uid / gid then?
I understood from Fanyong that according to the remote client design
requirements, we should not allow the remote client to access the server-side
uid/gid mapping.
> But more seriously, how is this encoded in such a way that the OST can
trust
> the information - it must be in the capability too?
hmm, I don''t see any capability associated with this.

Johann

Peter Braam

2008-Jun-06 12:21 UTC

head link

[Lustre-devel] Moving forward on Quotas

On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote:
> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>> Well, this protocol hasn''t been published yet; why not include
the server
>> side uid / gid then?
> 
> I understood from Fanyong that according to the remote client design
> requirements, we should not allow the remote client to access the
server-side
> uid/gid mapping.
You have two choices: break that rule OR let the MDS server do the ownership
changes.  It''s not complicated, just make a choice.
> 
>> But more seriously, how is this encoded in such a way that the OST can
trust
>> the information - it must be in the capability too?
> 
> hmm, I don''t see any capability associated with this.
We had better find it, otherwise there is a security hole.

> 
> Johann
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Yong Fan

2008-Jun-09 08:52 UTC

head link

[Lustre-devel] Moving forward on Quotas

Peter Braam ??:>
> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com>
wrote:
>
>   
>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>>     
>>> Well, this protocol hasn''t been published yet; why not
include the server
>>> side uid / gid then?
>>>       
>> I understood from Fanyong that according to the remote client design
>> requirements, we should not allow the remote client to access the
server-side
>> uid/gid mapping.
>>     
>
> You have two choices: break that rule OR let the MDS server do the
ownership
> changes.  It''s not complicated, just make a choice.
>
>   
>>> But more seriously, how is this encoded in such a way that the OST
can trust
>>> the information - it must be in the capability too?
>>>       
>> hmm, I don''t see any capability associated with this.
>>     
>
> We had better find it, otherwise there is a security hole.
>
>
>   I think we can divide clients into two sorts: trusted and untrusted.
The client reliability is defined by administrator. Remote client
should be counted as untrusted one. The best simple way is that:
local client is trusted, remote client is untrusted.

For the trusted ones, disable capability, OSS set file "uid" and
"gid"
with "oa.o_uid" and "oa.o_gid". It is the current using
means.
For the untrusted ones, enable capability, OSS set file "uid" and
"gid"
which contain in the OSS capability got from MDS when open.
With OSS capability enabled will affect performance, and current
capability''s design does not contain the consideration for that. We
can fix the OSS capability design in the task "o3_se_capa_review"

Note: since capability feature is time-consuming, we want to support
enforcing capabilities on selected clients (or somewhat MDS/OSS
capability should aim at remote client). On he other hand, enforcing
capabilities on selected clients can simply the capability interoperability
between HEAD and b1_6.

If this idea can be approved, then the current remote client uid mapping
rules can be unchanged, uid mapping on OST is unnecessary also.

Thanks!
--
Fan Yong>> Johann
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>     
>
>
>

Peter Braam

2008-Jun-09 15:37 UTC

head link

[Lustre-devel] Moving forward on Quotas

Thanks for explaining.

The trust concept you outline is definitely not acceptable - we need a
capability for all access and modifications done through clients on behalf
of a capability.

Even changes made by the MDS need to be secured - that can be through a
kerberos connection, and again, not through blind trust.

Would you please send the capability HLD to me?  Thanks!

- Peter -


On 6/9/08 2:52 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote:
> Peter Braam ??:
>> 
>> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at
sun.com> wrote:
>> 
>>   
>>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>>>     
>>>> Well, this protocol hasn''t been published yet; why not
include the server
>>>> side uid / gid then?
>>>>       
>>> I understood from Fanyong that according to the remote client
design
>>> requirements, we should not allow the remote client to access the
>>> server-side
>>> uid/gid mapping.
>>>     
>> 
>> You have two choices: break that rule OR let the MDS server do the
ownership
>> changes.  It''s not complicated, just make a choice.
>> 
>>   
>>>> But more seriously, how is this encoded in such a way that the
OST can
>>>> trust
>>>> the information - it must be in the capability too?
>>>>       
>>> hmm, I don''t see any capability associated with this.
>>>     
>> 
>> We had better find it, otherwise there is a security hole.
>> 
>> 
>>   
> I think we can divide clients into two sorts: trusted and untrusted.
> The client reliability is defined by administrator. Remote client
> should be counted as untrusted one. The best simple way is that:
> local client is trusted, remote client is untrusted.
> 
> For the trusted ones, disable capability, OSS set file "uid" and
"gid"
> with "oa.o_uid" and "oa.o_gid". It is the current using
means.
> For the untrusted ones, enable capability, OSS set file "uid" and
"gid"
> which contain in the OSS capability got from MDS when open.
> With OSS capability enabled will affect performance, and current
> capability''s design does not contain the consideration for that.
We
> can fix the OSS capability design in the task "o3_se_capa_review"
> 
> Note: since capability feature is time-consuming, we want to support
> enforcing capabilities on selected clients (or somewhat MDS/OSS
> capability should aim at remote client). On he other hand, enforcing
> capabilities on selected clients can simply the capability interoperability
> between HEAD and b1_6.
> 
> If this idea can be approved, then the current remote client uid mapping
> rules can be unchanged, uid mapping on OST is unnecessary also.
> 
> 
> Thanks!
> --
> Fan Yong
>>> Johann
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>     
>> 
>> 
>>   
>

Yong Fan

2008-Jun-09 16:09 UTC

head link

[Lustre-devel] Moving forward on Quotas

Peter Braam ??:> Thanks for explaining.
>
> The trust concept you outline is definitely not acceptable - we need a
> capability for all access and modifications done through clients on behalf
> of a capability.
>
> Even changes made by the MDS need to be secured - that can be through a
> kerberos connection, and again, not through blind trust.
>   The client reliability (trusted or untrusted) and the MDS/OSS capabilities
are not conflict. They can be combined together as mentioned in draft of
new MDS/OSS capabilities HLD.

capabilities.lyx is the old HLD
capability_hld.lyx is the new HLD which is in patch inspection and fix.


Thanks!
--
Fan Yong> Would you please send the capability HLD to me?  Thanks!
>
> - Peter -
>
>
> On 6/9/08 2:52 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote:
>
>   
>> Peter Braam ??:
>>     
>>> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at
sun.com> wrote:
>>>
>>>   
>>>       
>>>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>>>>     
>>>>         
>>>>> Well, this protocol hasn''t been published yet; why
not include the server
>>>>> side uid / gid then?
>>>>>       
>>>>>           
>>>> I understood from Fanyong that according to the remote client
design
>>>> requirements, we should not allow the remote client to access
the
>>>> server-side
>>>> uid/gid mapping.
>>>>     
>>>>         
>>> You have two choices: break that rule OR let the MDS server do the
ownership
>>> changes.  It''s not complicated, just make a choice.
>>>
>>>   
>>>       
>>>>> But more seriously, how is this encoded in such a way that
the OST can
>>>>> trust
>>>>> the information - it must be in the capability too?
>>>>>       
>>>>>           
>>>> hmm, I don''t see any capability associated with this.
>>>>     
>>>>         
>>> We had better find it, otherwise there is a security hole.
>>>
>>>
>>>   
>>>       
>> I think we can divide clients into two sorts: trusted and untrusted.
>> The client reliability is defined by administrator. Remote client
>> should be counted as untrusted one. The best simple way is that:
>> local client is trusted, remote client is untrusted.
>>
>> For the trusted ones, disable capability, OSS set file "uid"
and "gid"
>> with "oa.o_uid" and "oa.o_gid". It is the current
using means.
>> For the untrusted ones, enable capability, OSS set file "uid"
and "gid"
>> which contain in the OSS capability got from MDS when open.
>> With OSS capability enabled will affect performance, and current
>> capability''s design does not contain the consideration for
that. We
>> can fix the OSS capability design in the task
"o3_se_capa_review"
>>
>> Note: since capability feature is time-consuming, we want to support
>> enforcing capabilities on selected clients (or somewhat MDS/OSS
>> capability should aim at remote client). On he other hand, enforcing
>> capabilities on selected clients can simply the capability
interoperability
>> between HEAD and b1_6.
>>
>> If this idea can be approved, then the current remote client uid
mapping
>> rules can be unchanged, uid mapping on OST is unnecessary also.
>>
>>
>> Thanks!
>> --
>> Fan Yong
>>     
>>>> Johann
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>>     
>>>>         
>>>   
>>>       
>
>
>   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: capabilities.lyx
Type: application/x-lyx
Size: 8268 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080610/c0ee7ff3/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: capability_hld.lyx
Type: application/x-lyx
Size: 20857 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080610/c0ee7ff3/attachment-0003.bin

Yong Fan

2008-Jun-10 13:54 UTC

head link

[Lustre-devel] Moving forward on Quotas

Peter Braam ??:>
> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com>
wrote:
>
>   
>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>>     
>>> Well, this protocol hasn''t been published yet; why not
include the server
>>> side uid / gid then?
>>>       
>> I understood from Fanyong that according to the remote client design
>> requirements, we should not allow the remote client to access the
server-side
>> uid/gid mapping.
>>     
>
> You have two choices: break that rule OR let the MDS server do the
ownership
> changes.  It''s not complicated, just make a choice.
>   1) MDS changes file owner for objects on OSS when create.
It introduces one additional RPC between MDS and OSS
for each objects, which causes the per-create meaningless.
Maybe we can not accept the creation performance undre
such case.
An improvement for that is that we can initialize all
the pre-create objects on OSS with the same uid/gid as
the current created file. And the succedent creations
check the per-created object''s uid/gid, if they match
the owner''s uid/gid, then the additional setattr to OSS
is unnecessary, otherwise, chown to OSS will be sent.
The best case is equal to the current implementation;
the worst case is equal to no such improvement.
Under creation performance sensitivity case, maybe HPC
or some benchmark test. There are few uid/gid context
switch. So I think such improvement can match most
creation performance requirement.
2) Pack file owner into OSS capability when write to OSS.
According to our UID mapping rules, we do not want the
client (especially the untrusted client, maybe remote
client) to know which the server-side uid/gid are mapped.
For that, dynamic UID mapping are used, and some complex
remote acl operations are implemented.
So if file real owner (server-side uid/gid) need to be
packed in OSS capability back to client, bad better to
encrypt them to match our UID mapping rules. Otherwise,
most of our effort for that before are broken. On the
other hand, MDS/OSS capabilities use HMAC-SHA1 to prevent
from being modified or fabricated. Signing MAC is somewhat
time-consuming, and uid/gid encryption are similar, so the
performance will become worse. To implement server-side
uid/gid encryption and decryption, need to design some
key-update mechanism also.
So I do not think it is a good choice.
3) Establish UID mapping on OSS also, just like MDS does.
It has the same principle as done on MDS which needs
lower layer GSS support. Many codes can be shared
between MDS and OSS for UID mapping. It is the most
complete solution, little affect on creation performance
and UID mapping rules. The shortcoming is much changes
on OSS, how much GSS effort for that is not well estimated
yet.>   
>>> But more seriously, how is this encoded in such a way that the OST
can trust
>>> the information - it must be in the capability too?
>>>       
>> hmm, I don''t see any capability associated with this.
>>     
>
> We had better find it, otherwise there is a security hole.
>
>   Root user''s super permission should be considered for OSS
capability. On client-side, the opened file descriptor can
be shared by different threads, and the different threads
maybe have different "uid". Consider the following case:
Root user writes a file through other normal user (Tom)
opened file descriptor. The root user can breaks through
Tom''s quota limitation, he should tell OSS that he is a
root user, but how can the OSS believe him? Such information
should be contained in the OSS capability.
(Note: on lustre 1.6, llite set "OBD_BRW_NOQUOTA" directly
for that, which can be fabricated by baleful client)

Best Regards!
--
Fan Yong>   
>> Johann
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>     
>
>
>

Peter Braam

2008-Jun-10 16:51 UTC

head link

[Lustre-devel] Moving forward on Quotas

I did not read the entire long message, but clearly the importance of
authorizing operations on the OSS far exceeds that of exposing a uid to a
client.

Moreover, it doesn''t have to be exposed.  If the capability contains
the uid
it can be encrypted with a key so that only the OSS can see it.

This seems the preferred solution.

Peter


On 6/10/08 7:54 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote:
> Peter Braam ??:
>> 
>> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at
sun.com> wrote:
>> 
>>   
>>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:
>>>     
>>>> Well, this protocol hasn''t been published yet; why not
include the server
>>>> side uid / gid then?
>>>>       
>>> I understood from Fanyong that according to the remote client
design
>>> requirements, we should not allow the remote client to access the
>>> server-side
>>> uid/gid mapping.
>>>     
>> 
>> You have two choices: break that rule OR let the MDS server do the
ownership
>> changes.  It''s not complicated, just make a choice.
>>   
> 1) MDS changes file owner for objects on OSS when create.
> It introduces one additional RPC between MDS and OSS
> for each objects, which causes the per-create meaningless.
> Maybe we can not accept the creation performance undre
> such case.
> An improvement for that is that we can initialize all
> the pre-create objects on OSS with the same uid/gid as
> the current created file. And the succedent creations
> check the per-created object''s uid/gid, if they match
> the owner''s uid/gid, then the additional setattr to OSS
> is unnecessary, otherwise, chown to OSS will be sent.
> The best case is equal to the current implementation;
> the worst case is equal to no such improvement.
> Under creation performance sensitivity case, maybe HPC
> or some benchmark test. There are few uid/gid context
> switch. So I think such improvement can match most
> creation performance requirement.
> 2) Pack file owner into OSS capability when write to OSS.
> According to our UID mapping rules, we do not want the
> client (especially the untrusted client, maybe remote
> client) to know which the server-side uid/gid are mapped.
> For that, dynamic UID mapping are used, and some complex
> remote acl operations are implemented.
> So if file real owner (server-side uid/gid) need to be
> packed in OSS capability back to client, bad better to
> encrypt them to match our UID mapping rules. Otherwise,
> most of our effort for that before are broken. On the
> other hand, MDS/OSS capabilities use HMAC-SHA1 to prevent
> from being modified or fabricated. Signing MAC is somewhat
> time-consuming, and uid/gid encryption are similar, so the
> performance will become worse. To implement server-side
> uid/gid encryption and decryption, need to design some
> key-update mechanism also.
> So I do not think it is a good choice.
> 3) Establish UID mapping on OSS also, just like MDS does.
> It has the same principle as done on MDS which needs
> lower layer GSS support. Many codes can be shared
> between MDS and OSS for UID mapping. It is the most
> complete solution, little affect on creation performance
> and UID mapping rules. The shortcoming is much changes
> on OSS, how much GSS effort for that is not well estimated
> yet.
>>   
>>>> But more seriously, how is this encoded in such a way that the
OST can
>>>> trust
>>>> the information - it must be in the capability too?
>>>>       
>>> hmm, I don''t see any capability associated with this.
>>>     
>> 
>> We had better find it, otherwise there is a security hole.
>> 
>>   
> Root user''s super permission should be considered for OSS
> capability. On client-side, the opened file descriptor can
> be shared by different threads, and the different threads
> maybe have different "uid". Consider the following case:
> Root user writes a file through other normal user (Tom)
> opened file descriptor. The root user can breaks through
> Tom''s quota limitation, he should tell OSS that he is a
> root user, but how can the OSS believe him? Such information
> should be contained in the OSS capability.
> (Note: on lustre 1.6, llite set "OBD_BRW_NOQUOTA" directly
> for that, which can be fabricated by baleful client)
> 
> Best Regards!
> --
> Fan Yong
>>   
>>> Johann
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>     
>> 
>> 
>>   
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Lustre devel - May 2008 - Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas

[Lustre-devel] Moving forward on Quotas