[please send original message to Lustre devel also] Hi On 5/27/08 1:56 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Johann Lombardi writes: >> Hi all, >> > > [...] > >> >> * item #2: Supporting quotas with CMD >> >> The quota master is the only one having a global overview of the quota usages >> and limits. On b1_6, the quota master is the MDS and the quota slaves are the >> OSSs. The code is designed in theory to support several MDT slaves too, but >> some >> shortcuts have been taken and some additional work is needed to support an >> architecture with 1 quota master (one of the MDT) and several OSTs/MDTs >> slaves. > > From reading quota hld it is not clear that master has necessary be an > mdt server. Given that ost''s are going to have mdt-like recovery in 2.0 > it seems reasonable to hash uid/gid across all osts, that act as masters > (additionally, it seems logical to handle disk block allocation on ost''s > rather than mdt''s). Or am I missing something here?Yes - and this being possible was part of the plan originally.> >> >> * item #3: Supporting quotas with DMU >> >> ZFS does not support standard Unix quotas. Instead, it relies on fileset >> quotas. >> This is a problem because Lustre quotas are set on a per-uid/gid basis. >> To support ZFS, we are going to have to put OST objects in a dataset matching >> a >> dataset on the MDS. >> We also have to decide what kind of quota interface we want to have at the >> lustre level (do we still set quotas on uid/gid or do we switch to the >> dataset >> framework?). Things get more complicated if we want to support a MDS using >> ldiskfs and OSSs using ZFS (do we have to support this?). >> IMHO, in the future, Lustre will want to take advantage of the ZFS space >> reservation feature and since this also relies on dataset, I think we should >> adopt the ZFS quota framework at the lustre level too. >> That being said, my understanding of ZFS quotas is limited to this webpage: >> http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch05 >> s06.html >> and I haven''t had the time to dig further. > > As per discussion with ZFS team, they are going to implement per-user > and per-group block quotas in ZFS (inode quotas make little sense for > ZFS).Why do they not need file quota - what if someone wants to control file count?> Going aside, if I were designing quota from the scratch right now, I > would implement it completely inside of Lustre. All that is needed for > such an implementation is a set of call-backs that local file-system > invokes when it allocates/frees blocks (or inodes) for a given > object. Lustre would use these call-backs to transactionally update > local quota in its own format. That would save us a lot of hassle we > have dealing with the changing kernel quota interfaces, uid re-mappings, > and subtle differences between quota implementations on a different file > systems.======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get quota implemented in this way in DMU then.> Additionally, this is better aligned with the way we handle access > control: MDT implements access control checks completely within MDD, > without relying on underlying file system. > >> > > [...] > >> >> * issue #2: Quota accuracy >> >> When a slave runs out of its local quota, it sends an acquire request to the >> quota master. As I said earlier, the quota master is the only one having a >> global overview of what has been granted to slaves. If the master can satisfy >> the request, it grants a qunit (can be a number of blocks or inodes) to the >> slave. The problem is that an OST can return "quota exceeded" (=EDQUOT) >> whereas >> another OST is still having quotas. There is currently no callback to claim >> back the quota space that has been granted to a slave.Hmm - the slave should release quota. Arguing about the last qunit of quota is pointless, we are NOT interested in quota that are absurdly small I think, they would interfere with performance and the whole architecture of quota uses qunits to leave performance unaffected.> > What strikes me in this description is how this is similar to DLM. It > almost looks like quota can be easily implemented as a special type of > lock, and DLM conflict resolution mechanism with cancellation AST''s can > be used to reclaim quota.The slave should release. That doesn''t address the issue of all OST''s consistently reporting EDQUOT. However, doing that in a persistent way has its own troubles maybe, namely, how would that be released and recovery after power off on servers. If not persistent, it would be pointless because after server reboots, since OSS''s with space left would still give the wrong answer. Peter> >> > > [...] > >> >> Cheers, >> Johann > > Nikita.
Hello Peter, On Tue, May 27, 2008 at 07:28:10AM +0800, Peter Braam wrote:> >> When a slave runs out of its local quota, it sends an acquire request to the > >> quota master. As I said earlier, the quota master is the only one having a > >> global overview of what has been granted to slaves. If the master can satisfy > >> the request, it grants a qunit (can be a number of blocks or inodes) to the > >> slave. The problem is that an OST can return "quota exceeded" (=EDQUOT) > >> whereas > >> another OST is still having quotas. There is currently no callback to claim > >> back the quota space that has been granted to a slave. > > Hmm - the slave should release quota.I don''t think that the slave can make such a decision by itself since it does not know that we are getting closer to the global quota limit. Only the master is aware of this. Actually, the scenario I described above can no longer happen - with recent lustre versions at least - thanks to the dynamic qunit patch because the master broadcasts to all the slaves the new qunit size when it is shrunk. Cheers, Johann
On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote:> > Going aside, if I were designing quota from the scratch right now, I > > would implement it completely inside of Lustre. All that is needed for > > such an implementation is a set of call-backs that local file-system > > invokes when it allocates/frees blocks (or inodes) for a given > > object. Lustre would use these call-backs to transactionally update > > local quota in its own format. That would save us a lot of hassle we > > have dealing with the changing kernel quota interfaces, uid re-mappings, > > and subtle differences between quota implementations on a different file > > systems. > > ======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get quota > implemented in this way in DMU then.I think this was proposed by Alex before, but AFAIU the conclusion is that this was not possible to do with ZFS (or at least, not easy to do). The problem is that ZFS uses delayed allocations, i.e., allocations occur long after a transaction group has been closed, and therefore we can''t transactionally keep track of allocated space because by the time the callbacks were called we are not allowed to write to the transaction group anymore, since another 2 txgs could have been opened already. Since this couldn''t be done transactionally, if the node crashes, there would be no way of knowing how many blocks had been allocated on the latest (actually, the latest 2) committed transaction groups.. Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/a7e653b2/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/a7e653b2/attachment.gif
Ricardo M. Correia writes: > On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote: > > > > Going aside, if I were designing quota from the scratch right now, I > > > would implement it completely inside of Lustre. All that is needed for > > > such an implementation is a set of call-backs that local file-system > > > invokes when it allocates/frees blocks (or inodes) for a given > > > object. Lustre would use these call-backs to transactionally update > > > local quota in its own format. That would save us a lot of hassle we > > > have dealing with the changing kernel quota interfaces, uid re-mappings, > > > and subtle differences between quota implementations on a different file > > > systems. > > > > ======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get quota > > implemented in this way in DMU then. > > > I think this was proposed by Alex before, but AFAIU the conclusion is > that this was not possible to do with ZFS (or at least, not easy to do). > > The problem is that ZFS uses delayed allocations, i.e., allocations > occur long after a transaction group has been closed, and therefore we > can''t transactionally keep track of allocated space because by the time > the callbacks were called we are not allowed to write to the transaction > group anymore, since another 2 txgs could have been opened already. But that problem has to be solved anyway to implement per-user quotas for ZFS, correct? One possible solution I see is to use something like ZIL to log operations in the context of current transaction group. This log can be replayed during mount to update quota file. > > Since this couldn''t be done transactionally, if the node crashes, there > would be no way of knowing how many blocks had been allocated on the > latest (actually, the latest 2) committed transaction groups.. > > Regards, > Ricardo Nikita.
On Qua, 2008-05-28 at 18:54 +0400, Nikita Danilov wrote:> But that problem has to be solved anyway to implement per-user quotas > for ZFS, correct?Indeed, but it''s probably easier and more reliable to make the DMU itself update an internal quota/space accounting DMU object when a txg is syncing (updating internal objects during txg sync is something that the DMU already does, e.g., for spacemaps) than allow arbitrary modifications to a transaction group after it has been closed.> One possible solution I see is to use something like ZIL to log > operations in the context of current transaction group. This log can be > replayed during mount to update quota file.Hmm.. I''m not sure if it would be easy to figure out during replay how many blocks were freed, especially considering things like snapshots, clones and deferred frees (if frees are making a txg sync to take too long to converge, the DMU will add them to a freelist object, instead of freeing them immediately). I agree that quotas could be implemented in Lustre (independent of the backend filesystem), but IMHO I think it would make more sense for the space accounting to be done in the DMU itself due to the complexities associated with it''s internal behaviour. Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/86bce9ae/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/86bce9ae/attachment-0001.gif
Peter Braam writes: > > Why do they not need file quota - what if someone wants to control file > count? Idea is that support for the inode quota is not necessary is DMU, as upper layer can implement it by itself. Compare this with block quotas that do require support from DMU, because upper layer cannot generally tell overwrite from new-write. > > Nikita. >
Ricardo M. Correia writes: > On Qua, 2008-05-28 at 18:54 +0400, Nikita Danilov wrote: > > > But that problem has to be solved anyway to implement per-user quotas > > for ZFS, correct? > > > Indeed, but it''s probably easier and more reliable to make the DMU > itself update an internal quota/space accounting DMU object when a txg > is syncing (updating internal objects during txg sync is something that > the DMU already does, e.g., for spacemaps) than allow arbitrary > modifications to a transaction group after it has been closed. Even doing it internally looks rather involved. The problem, as I understand it, is that no new block can be allocated while transaction group is in sync state (?), so DMU would have to track all users and groups whose quota is affected by the current transaction group, and before closing the group, to allocate some kind of on-disk table with an entry for every updated quota, then to fill these entries later when actual disk space is allocated. Note that dmu has to know about users and groups to implement quota internally, which looks like a pervasive interface change. > > I agree that quotas could be implemented in Lustre (independent of the > backend filesystem), but IMHO I think it would make more sense for the > space accounting to be done in the DMU itself due to the complexities > associated with it''s internal behaviour. I absolutely agree that DMU has to do space _accounting_ internally. The question is how to store results of this accounting, without bothering DMU with higher level concepts such as a user or a group identifier. I think that utility of DMU as a universal back-end would improve if it were to export an interface allowing its users to update, with certain restrictions, on-disk state when transaction group is in sync (i.e., interface similar to one that is internally used for spacemaps). > > Regards, > Ricardo Nikita.
On Qua, 2008-05-28 at 20:22 +0400, Nikita Danilov wrote:> Even doing it internally looks rather involved. The problem, as I > understand it, is that no new block can be allocated while transaction > group is in sync state (?)I''m not sure if you are describing it incorrectly or just using the same terms for different concepts, but in any case, blocks are allocated *while* the transaction group is syncing, and due to compression and online pool configuration changes it is impossible to know the exact on-disk space a given block will use until the transaction group is actually syncing.> so DMU would have to track all users and > groups whose quota is affected by the current transaction group, and > before closing the group, to allocate some kind of on-disk table with an > entry for every updated quota, then to fill these entries later when > actual disk space is allocated.Yes, that sounds correct.> Note that dmu has to know about users and groups to implement quota > internally, which looks like a pervasive interface change.No, AFAIK, the consensus we reached with the ZFS team is that, ?since the DMU does not have any concept of users or groups, it will track space usage associated with opaque identifiers, so that when we write to a file we would give it 2 identifiers which, for us, one of them would map to a user and the other one to a group.> I absolutely agree that DMU has to do space _accounting_ internally. The > question is how to store results of this accounting, without bothering > DMU with higher level concepts such as a user or a group identifier.I really don''t think we should allow the consumer to write to a txg which is already in the syncing phase, I think the DMU should store the accounting itself.> I think that utility of DMU as a universal back-end would improve if it > were to export an interface allowing its users to update, with certain > restrictions, on-disk state when transaction group is in sync (i.e., > interface similar to one that is internally used for spacemaps).Hmm.. I''m not sure if that would be very useful, why not write the data when the txg was open in the first place? ??Maybe you can give a better example? For things that requires knowledge of DMU internals (like space accounting, spacemaps, ...) it shouldn''t be the DMU consumer that has to write during the txg sync phase, it should be the DMU because only the DMU should know about its internals. The example you have given (spacemaps) is the worst of all, because spacemap updates are rather involved. Due to ?COW and to the ZIO pipeline design, spacemap modifications lead to a chicken-and-egg problem with transactional updates: When you modify a space map, you create a ZIO which just before writing leads to an allocation (due to COW). But since you need to do an allocation, you need to change the spacemap again, which leads to another allocation (and also leads to free the old just-written block), so you need to update the space map again, and so on and on.. (!) This is why txgs need to converge and why after a few phases it gives up freeing blocks, and starts re-using blocks which were freed on the same txg. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/ed189e64/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/ed189e64/attachment.gif
Ricardo M. Correia writes: > On Qua, 2008-05-28 at 20:22 +0400, Nikita Danilov wrote: [...] > > I''m not sure if you are describing it incorrectly or just using the same > terms for different concepts, but in any case, blocks are allocated > *while* the transaction group is syncing, and due to compression and > online pool configuration changes it is impossible to know the exact > on-disk space a given block will use until the transaction group is > actually syncing. I meant that the table mentioned below cannot grow, while transaction group is syncing, which means that dmu has to calculate size of the table in advance. [...] > > > Note that dmu has to know about users and groups to implement quota > > internally, which looks like a pervasive interface change. > > > No, AFAIK, the consensus we reached with the ZFS team is that, ???since > the DMU does not have any concept of users or groups, it will track > space usage associated with opaque identifiers, so that when we write to > a file we would give it 2 identifiers which, for us, one of them would > map to a user and the other one to a group. Well... that''s just renaming uid and gid into opaqueid0 and opaqueid1. :-) So on one hand we have to add a couple of parameters to all dmu entry points that can allocate disk space. On the other hand we have something like typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid, long bytes); void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t cb); with dmu calling registered call-back when blocks are actually allocated to the object. Advantage of the later interface is that dmu implements only mechanism, and policy ("user quotas" and "group quotas") is left to the upper layers to implement. [...] > > I really don''t think we should allow the consumer to write to a txg > which is already in the syncing phase, I think the DMU should store the > accounting itself. One important aspect of lustre quota requirements that wasn''t mentioned so far is that Lustre needs something more than -EDQUOT from the file system. For example, to integrate quotas with dirty cache grants, server has to know how much quota is left, to redistribute quota across ost''s it has to modify quota, etc. If quota management and storage are completely encapsulated within dmu, then dmu has to provide full quota control interface too, and that interface has to be exported from osd upward. For one thing, implementation of this interface is going to take a lot of time. [...] > > For things that requires knowledge of DMU internals (like space > accounting, spacemaps, ...) it shouldn''t be the DMU consumer that has to > write during the txg sync phase, it should be the DMU because only the > DMU should know about its internals. I don''t quite understand this argument. DMU already has an interface to capture a buffer into a transaction and to modify it within this transaction. An interface to modify a buffer after transaction was closed, but before it is committed is no more "internal" than the first one. It is just places more restrictions on what consumer is allowed to do with the buffer. > When you modify a space map, you create a ZIO which just before writing > leads to an allocation (due to COW). But since you need to do an > allocation, you need to change the spacemap again, which leads to > another allocation (and also leads to free the old just-written block), > so you need to update the space map again, and so on and on.. (!) > This is why txgs need to converge and why after a few phases it gives up > freeing blocks, and starts re-using blocks which were freed on the same > txg. Good heavens. :-) > > Cheers, > Ricardo Nikita.
On Qui, 2008-05-29 at 00:06 +0400, Nikita Danilov wrote:> Well... that''s just renaming uid and gid into opaqueid0 and > opaqueid1. :-)It may be simple for POSIX uids/gids, but maybe not for CIFS user ids (but since there is a mapping table it may be also be simple, I''m not sure). I think it would make sense to give it a list of opaque ids, instead of just opaqueid0 and opaqueid1 (maybe a file could belong to multiple groups or maybe we can think of other creative ideas in the future..).> So on one hand we have to add a couple of parameters to all dmu entry > points that can allocate disk space. On the other hand we have > something > like > > typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid, > long bytes); > > void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t > cb); > > with dmu calling registered call-back when blocks are actually > allocated > to the object. Advantage of the later interface is that dmu implements > only mechanism, and policy ("user quotas" and "group quotas") is left > to > the upper layers to implement.I don''t see why that would be an advantage over what we had planned to do. The plan we discussed with the ZFS team was to make the DMU do space accounting internally by opaque ids, so the quota policy/enforcement would still be left to the upper layers to implement.> If quota management and storage are > completely encapsulated within dmu, then dmu has to provide full quota > control interface too, and that interface has to be exported from osd > upward. For one thing, implementation of this interface is going to take > a lot of time.Again, the plan was for the DMU to do only space accounting, the actual quota management and enforcement would be implemented in Lustre.> > For things that requires knowledge of DMU internals (like space > > accounting, spacemaps, ...) it shouldn''t be the DMU consumer that has to > > write during the txg sync phase, it should be the DMU because only the > > DMU should know about its internals. > > I don''t quite understand this argument. DMU already has an interface to > capture a buffer into a transaction and to modify it within this > transaction. An interface to modify a buffer after transaction was > closed, but before it is committed is no more "internal" than the first > one. It is just places more restrictions on what consumer is allowed to > do with the buffer.What I mean is that IMO a consumer of a filesystem shouldn''t have to know intimate details of how the filesystem (in this case, the DMU) works. For instance, so far Lustre had no idea that transactions are actually grouped into transaction groups and it had no idea about transaction group states. Allowing modification of buffers by an upper layer when a transaction group is already syncing is not a very elegant way to solve this IMHO (compared with our previous plan).. :-) Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/c1d8c970/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/c1d8c970/attachment-0001.gif
Ricardo M. Correia writes: [...] > The plan we discussed with the ZFS team was to make the DMU do space > accounting internally by opaque ids, so the quota policy/enforcement > would still be left to the upper layers to implement. Hm.. seems I am confused. If ZFS stores quota table internally, how this table is available to the upper layers, to implement policy? Nikita.
On Qui, 2008-05-29 at 01:11 +0400, Nikita Danilov wrote:> Hm.. seems I am confused. If ZFS stores quota table internally, how this > table is available to the upper layers, to implement policy?uint64_t dmu_get_space_usage(objset, opaque_id)? -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/87c8d71f/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080528/87c8d71f/attachment.gif
Ricardo M. Correia writes: > On Qui, 2008-05-29 at 01:11 +0400, Nikita Danilov wrote: > > > Hm.. seems I am confused. If ZFS stores quota table internally, how this > > table is available to the upper layers, to implement policy? > > > uint64_t dmu_get_space_usage(objset, opaque_id)? It is not immediately clear how chown and chgrp can be implemented on top of this. Plus an interface to iterate over this table is most likely required for quota tools. To return to the question why using call-backs notifying about object space allocation changes is preferable to maintaining space allocation per-id in file system: opaqueid''s have to be visible not only in the dmu interface, but also in osd interface, because quota policy is implemented above osd. But - osd knows nothing about users and groups (it uses capability-based access control), and its interface has to be expanded to pass through identifiers that it doesn''t understand and doesn''t use. This looks wrong. osd operates on objects, identified by fids, and it would be much more natural to do space accounting on the object granularity. - osd interface is identical on all platforms, so, in effect, zfs space accounting interface is enforced on ldiskfs (and on all possible future back-ends). > Ricardo Manuel Correia > Lustre Engineering Nikita.
Regardless of how file quota has to be implemented, we do need file quota, just to be clear. Peter On 5/28/08 11:24 PM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Peter Braam writes: >> >> Why do they not need file quota - what if someone wants to control file >> count? > > Idea is that support for the inode quota is not necessary is DMU, as > upper layer can implement it by itself. Compare this with block quotas > that do require support from DMU, because upper layer cannot generally > tell overwrite from new-write. > >>> > > Nikita. > >>
Hi Nikita, On Sex, 2008-05-30 at 20:38 +0400, Nikita Danilov wrote:> What about the following: > > - dmu tracks per-object `space usage'', in addition to usual block > count as reported by st_blocks.Currently, the space reported by st_blocks is calculated from dnode_phys_t->dn_used, which in recent SPA versions tracks the number of allocated bytes (not blocks) of a DMU object, which is accurate up to the last committed txg. Is this what you mean by "space usage"?> - when space is actually allocated during transaction sync, dmu > notifies its user about changes in space usage by invoking some > > void (*space_usage)(objset_t *os, __u64 objid, __s64 delta); > > call-back, registered by user.Ok.> - user updates its data-structures in the context of the currently > open transaction.Ok.> - dmu internally updates space usage information in the context of > transaction being synced.This is being done per-object already.> - it also records a list (let''s call this "pending list") of all > object whose space allocation changed in the context of the same > transaction.Ok, this is where I am starting not to like.. :)> - after a mount, dmu calls ->space_usage() against all objects in > the pending lists of last committed transaction group, to update > client''s data-structures that are possibly stale due to the loss > > of next transaction group.What do you mean by mount? Do you mean when starting an OST?> Do you think that might work??If I understood correctly, the pending list you propose sounds like a recovery mechanism (similar to a log) which I don''t think is the right way to implement this. First of all, I think you would need to keep track of objects changed in the last 2 synced transaction groups, not just the last one. The reason is that when the DMU is syncing transaction group N, it is likely that you can only be writing to transaction group N+2, because transaction group N+1 may already be quiescing. This presents a challenge because if the machines crashes, you may lose data in 2 transaction groups, not just 1, which I think would make things harder to recover.. Another problem it this: let''s say the DMU is syncing a transaction group, and starts calling ->space_usage() for objects. Now the machine crashes, and comes up again. Now how do you distinguish between which objects were called ->space_usage() in the transaction group that was syncing and which weren''t (or how would you differentiate between ->space_usage() calls of txg N and those of txg N+1)? At a minimum, you would need a txg parameter in ->space_usage(), which again is leaking a bit internal knowledge of how the DMU works outside the DMU (and which we may not assume will always work the same way in future versions). ?Another thing that comes to mind is that the pending list is something very problem-specific and that would only be useful for Lustre, not other consumers, so the ZFS team may object to this.. For example, for implementing uid/gid quotas in ZFS, there is no need for such a mechanism.. And furthermore, I think this kind of recovery could be better implemented using commit callbacks, which is an abstraction already designed for recovery purposes and which is backend-agnostic. Ok, now stepping outside of the pending list (which I may have not understood the purpose correctly at all :-), I think implementing quotas in ZFS is harder that it may look at first sight. For example, let''s say you have 1 MB of quota left. How do you determine how much data you can write before the quota runs out? This may shock you, but depending on the pool configuration, filesystem properties and object block size, writing 1 MB of file data can take anywhere from exactly 0 bytes to 9.25 MB of allocated space (!!). Now let''s scale this up and imagine you have 1 GB of quota left, and you write 1 GB of data (and you do this sufficiently fast enough). In the worst case scenario, you could end up going 8.25 GB over the limit, which goes against any possible wish of having fine-grained quotas.. :-) BTW, this reminds me to that I am almost sure our uOSS grants code is wrong (I have not been assigned as an inspector, so I can''t say how bad it is..). Perhaps I am making concentrating too much on correctness.. maybe going over a quota is not too big of a deal, I remember some conversations between Andreas and the ZFS team which implied that not having 100% correctness is not too big of a problem. ?However, I am not so sure about grants.. :/ Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/cd2f9f1f/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/cd2f9f1f/attachment-0001.gif
On S?b, 2008-05-31 at 16:31 +0100, Ricardo M. Correia wrote:> This may shock you, but depending on the pool configuration, > filesystem properties and object block size, writing 1 MB of file data > can take anywhere from exactly 0 bytes to 9.25 MB of allocated space > (!!).Let me correct that: in the very worst case scenario, writing 1 MB of file data can actually consume (a bit more than) 11.25 MB of allocated space.. -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/4031336e/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/4031336e/attachment.gif
Ricardo M. Correia writes: > Hi Nikita, Hello, (I reordered some of the comments below.) > Currently, the space reported by st_blocks is calculated from > dnode_phys_t->dn_used, which in recent SPA versions tracks the number of > allocated bytes (not blocks) of a DMU object, which is accurate up to > the last committed txg. > Is this what you mean by "space usage"? I meant a counter of bytes or blocks that this object occupies for the quota purposes. I specifically don''t want to identify `space usage'' with st_blocks, because for the modern file systems there is no _the_ way to define what to include into quota: users want quota to be consistent with both df(1) and du(1) and in the presence of features like snapshots this is not generally possible. > What do you mean by mount? Do you mean when starting an OST? Yes, OST or MDT. > First of all, I think you would need to keep track of objects changed in > the last 2 synced transaction groups, not just the last one. The reason Indeed, I omitted this for the sake of clarity. > group N+1 may already be quiescing. This presents a challenge because if > the machines crashes, you may lose data in 2 transaction groups, not > just 1, which I think would make things harder to recover.. Won''t it be enough to record in the pending list object from two last transaction groups, if necessary? > > Another problem it this: let''s say the DMU is syncing a transaction > group, and starts calling ->space_usage() for objects. Now the machine > crashes, and comes up again. > Now how do you distinguish between which objects were called > ->space_usage() in the transaction group that was syncing and which > weren''t (or how would you differentiate between ->space_usage() calls of But we don''t have to, if we make ->space_usage() idempotent, i.e., taking an absolute space usage as a last argument, rather than delta. In that case, DMU is free to call it multiple times, and client has to cope with this. (Hmm... I am pretty sure this is what I was thinking about when composing previous message, but confusing signed __s64 delta somehow got in, sorry.) > > - dmu internally updates space usage information in the context of > > transaction being synced. > > > This is being done per-object already. Aha, this simplifies the whole story significantly. If dmu already maintains for every object a space usage counter that is suitable for the quota, then `pending list'' can be maintained by dmu client, without any additional support from dmu: - when (as a part of open transaction), client does an operation that can potentially modify space usage it adds object identifier into pending list, implemented as a normal dmu object; - when disk space is actually allocated (transaction group is in sync mode), client gets ->space_usage() call-back as above; - on a `mount'' client scans pending list object, fetches space usage from the dmu, updates client''s internal data-structures, and prunes the pending list. Of course, again, pending log has to keep track of objects modified in the last 2 transaction groups. With the help of commit call-back even ->space_usage() seems unnecessary, because on a commit time client can scan pending list (in memory). Heh, it seems that the quota can be implemented completely outside of the dmu. > > And furthermore, I think this kind of recovery could be better > implemented using commit callbacks, which is an abstraction already > designed for recovery purposes and which is backend-agnostic. Sounds interesting, can you elaborate on this? > Perhaps I am making concentrating too much on correctness.. maybe going > over a quota is not too big of a deal, I remember some conversations > between Andreas and the ZFS team which implied that not having 100% > correctness is not too big of a problem. ???However, I am not so sure > about grants.. :/ It''s my impression too that the agreement was to sacrifice some degree of correctness to simplify implementation. > > Regards, > Ricardo Nikita.
On S?b, 2008-05-31 at 20:19 +0400, Nikita Danilov wrote:> I meant a counter of bytes or blocks that this object occupies for the > quota purposes. I specifically don''t want to identify `space usage'' with > st_blocks, because for the modern file systems there is no _the_ way to > define what to include into quota: users want quota to be consistent > with both df(1) and du(1) and in the presence of features like snapshots > this is not generally possible.I think dnode_phys_t->dn_used can be used for this, because AFAICS it keeps track of allocated space referenced by the active filesystem (in other words, it does not include space which is referenced only by snapshots). I am assuming snapshots should not have any effect on quotas, right?> > group N+1 may already be quiescing. This presents a challenge because if > > the machines crashes, you may lose data in 2 transaction groups, not > > just 1, which I think would make things harder to recover.. > > Won''t it be enough to record in the pending list object from two last > transaction groups, if necessary?Hmm.. I think so, but I think we should not rely on this always being 2, I think we should allow the list to have an unbounded size, and let the commit callbacks notify when an entry can be pruned from the list.> But we don''t have to, if we make ->space_usage() idempotent, i.e., > taking an absolute space usage as a last argument, rather than delta. In > that case, DMU is free to call it multiple times, and client has to cope > with this. (Hmm... I am pretty sure this is what I was thinking about > when composing previous message, but confusing signed __s64 delta > somehow got in, sorry.)Ok, that clears my previous concern. But in this case, how do you know how much you need to add or subtract to a quota when an object changes size? I am guessing that you''d need to at least write the previous object size as part of the pending list, so that when you''re recovering you''d know the delta.. Heh, to me this whole thing sounds quite complicated to get right, but I think it could work (but of course, fine-grained hard quotas is another matter altogether..)..> > And furthermore, I think this kind of recovery could be better > > implemented using commit callbacks, which is an abstraction already > > designed for recovery purposes and which is backend-agnostic. > > Sounds interesting, can you elaborate on this?I was thinking that commit callbacks could be used by the DMU consumer to solve this problem instead of having an internal DMU list/log, and I guess you''ve already figured out how that could be done :) Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/18819888/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080531/18819888/attachment.gif
Ricardo M. Correia writes: > I am assuming snapshots should not have any effect on quotas, right? That''s the problem: on the one hand, users want to use quota to control actual disk usage, e.g., to not run out of disk space. For this, one wants to include snapshots into quota. On the other hand, space in snapshots is not easily reclaimable, which is not very convenient, because user might be without a way to clean up space. Note that this situation is possible without snapshots: other user can hard-link your file from a directory to which you have no access. > > Hmm.. I think so, but I think we should not rely on this always being 2, > I think we should allow the list to have an unbounded size, and let the > commit callbacks notify when an entry can be pruned from the list. Right. > > But we don''t have to, if we make ->space_usage() idempotent, i.e., > > taking an absolute space usage as a last argument, rather than delta. In > > that case, DMU is free to call it multiple times, and client has to cope > > with this. (Hmm... I am pretty sure this is what I was thinking about > > when composing previous message, but confusing signed __s64 delta > > somehow got in, sorry.) > > > Ok, that clears my previous concern. > But in this case, how do you know how much you need to add or subtract > to a quota when an object changes size? > I am guessing that you''d need to at least write the previous object size > as part of the pending list, so that when you''re recovering you''d know > the delta.. Yes, something similar to what data-bases do for redo logging, where the same log entry can be replayed multiple times. > > Cheers, > Ricardo > -- Nikita.
Jeff - could you get in touch with Nikita and Ricardo and assist them with a draft of quota design for the DMU. Nikita has some interesting API proposals, but there are some pretty deep ZFS issues involved where help would be welcome, as far as I can see. Just as a heads up, quota in systems like Lustre is quite a difficult issue, as many servers contribute to quota usage and this needs "acquire", and "release" of quota in reasonable chunks to avoid the server server protocol getting too chatty. Thank you for your help! Peter On 5/28/08 10:54 PM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Ricardo M. Correia writes: >> On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote: >> >>>> Going aside, if I were designing quota from the scratch right now, I >>>> would implement it completely inside of Lustre. All that is needed for >>>> such an implementation is a set of call-backs that local file-system >>>> invokes when it allocates/frees blocks (or inodes) for a given >>>> object. Lustre would use these call-backs to transactionally update >>>> local quota in its own format. That would save us a lot of hassle we >>>> have dealing with the changing kernel quota interfaces, uid re-mappings, >>>> and subtle differences between quota implementations on a different file >>>> systems. >>> >>> ======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get quota >>> implemented in this way in DMU then. >> >> >> I think this was proposed by Alex before, but AFAIU the conclusion is >> that this was not possible to do with ZFS (or at least, not easy to do). >> >> The problem is that ZFS uses delayed allocations, i.e., allocations >> occur long after a transaction group has been closed, and therefore we >> can''t transactionally keep track of allocated space because by the time >> the callbacks were called we are not allowed to write to the transaction >> group anymore, since another 2 txgs could have been opened already. > > But that problem has to be solved anyway to implement per-user quotas > for ZFS, correct? > > One possible solution I see is to use something like ZIL to log > operations in the context of current transaction group. This log can be > replayed during mount to update quota file. > >> >> Since this couldn''t be done transactionally, if the node crashes, there >> would be no way of knowing how many blocks had been allocated on the >> latest (actually, the latest 2) committed transaction groups.. >> >> Regards, >> Ricardo > > Nikita.
I am quite worried about the dynamic qunit patch. I am not convinced I want smaller qunits to stick around. Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise they create too much server - server overhead. The cost of 100MB of disk space is barely more than a cent now; what are we trying to address withtiny qunits? Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to 100TB/sec in I/O. Calculate quota RPC traffic from that. A server cannot handle more than 15,000 RPC''s / sec. No arguing, or opinions here, numbers please. The original design I did 4 years ago limited quota calls from one OSS to the master to one per second. Qunits were made adaptive without solid reasoning or design. Peter On 5/28/08 4:06 PM, "Johann Lombardi" <johann at sun.com> wrote:> Hello Peter, > > On Tue, May 27, 2008 at 07:28:10AM +0800, Peter Braam wrote: >>>> When a slave runs out of its local quota, it sends an acquire request to >>>> the >>>> quota master. As I said earlier, the quota master is the only one having a >>>> global overview of what has been granted to slaves. If the master can >>>> satisfy >>>> the request, it grants a qunit (can be a number of blocks or inodes) to the >>>> slave. The problem is that an OST can return "quota exceeded" (=EDQUOT) >>>> whereas >>>> another OST is still having quotas. There is currently no callback to claim >>>> back the quota space that has been granted to a slave. >> >> Hmm - the slave should release quota. > > I don''t think that the slave can make such a decision by itself since it does > not know that we are getting closer to the global quota limit. Only the master > is aware of this. > Actually, the scenario I described above can no longer happen - with recent > lustre versions at least - thanks to the dynamic qunit patch because the > master broadcasts to all the slaves the new qunit size when it is shrunk. > > Cheers, > Johann
For CIFS quota and user ids get in touch with Matt Wu and Mike Shapiro. There are special interfaces in ZFS/DMU for better windows support that we probably need to leverage. Peter On 5/29/08 5:07 AM, "Ricardo M. Correia" <Ricardo.M.Correia at Sun.COM> wrote:> On Qui, 2008-05-29 at 00:06 +0400, Nikita Danilov wrote: >> >> Well... that''s just renaming uid and gid into opaqueid0 and >> opaqueid1. :-) > > It may be simple for POSIX uids/gids, but maybe not for CIFS user ids (but > since there is a mapping table it may be also be simple, I''m not sure). > I think it would make sense to give it a list of opaque ids, instead of just > opaqueid0 and opaqueid1 (maybe a file could belong to multiple groups or maybe > we can think of other creative ideas in the future..). > >> So on one hand we have to add a couple of parameters to all dmu entry >> points that can allocate disk space. On the other hand we have something >> like >> >> typedef void (*dmu_alloc_callback_t)(objset_t *os, uint64_t objid, long >> bytes); >> >> void dmu_alloc_callback_register(objset_t *os, dmu_alloc_callback_t cb); >> >> with dmu calling registered call-back when blocks are actually allocated >> to the object. Advantage of the later interface is that dmu implements >> only mechanism, and policy ("user quotas" and "group quotas") is left to >> the upper layers to implement. > > I don''t see why that would be an advantage over what we had planned to do. > The plan we discussed with the ZFS team was to make the DMU do space > accounting internally by opaque ids, so the quota policy/enforcement would > still be left to the upper layers to implement. > >> >> If quota management and storage are >> completely encapsulated within dmu, then dmu has to provide full quota >> control interface too, and that interface has to be exported from osd >> upward. For one thing, implementation of this interface is going to take >> a lot of time. > > Again, the plan was for the DMU to do only space accounting, the actual quota > management and enforcement would be implemented in Lustre. > > >> >>> > For things that requires knowledge of DMU internals (like space >>> > accounting, spacemaps, ...) it shouldn''t be the DMU consumer that has to >>> > write during the txg sync phase, it should be the DMU because only the >>> > DMU should know about its internals. >> >> I don''t quite understand this argument. DMU already has an interface to >> capture a buffer into a transaction and to modify it within this >> transaction. An interface to modify a buffer after transaction was >> closed, but before it is committed is no more "internal" than the first >> one. It is just places more restrictions on what consumer is allowed to >> do with the buffer. > > What I mean is that IMO a consumer of a filesystem shouldn''t have to know > intimate details of how the filesystem (in this case, the DMU) works. > For instance, so far Lustre had no idea that transactions are actually grouped > into transaction groups and it had no idea about transaction group states. > > Allowing modification of buffers by an upper layer when a transaction group is > already syncing is not a very elegant way to solve this IMHO (compared with > our previous plan).. :-) > > Cheers, > Ricardo > > -- > Ricardo Manuel Correia > Lustre Engineering > > Sun Microsystems, Inc. > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080601/3683b206/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080601/3683b206/attachment.gif
That the total number of quota RPC''s made by all OSS''s (up to 5000 of them) should not exceed something like 1/3 of this, based on today''s hardware, otherwise quota traffic will slow the cluster down. We have seen big servers do 15,000 updates/sec sustained, so that is a measure we can use at the moment. Peter On 6/1/08 10:41 AM, "Robert Gordon" <Robert.Gordon at Sun.COM> wrote:> > On May 31, 2008, at 9:32 PM, Peter Braam wrote: > >> A server cannot handle more than 15,000 RPC''s / sec. > > I''m curious (always ;) ) .. > > what do you exactly mean by this statement ? > > Thanks, > Robert..
On Sun, Jun 01, 2008 at 10:36:33AM +0800, Peter Braam wrote:> For CIFS quota and user ids get in touch with Matt Wu and Mike Shapiro. > There are special interfaces in ZFS/DMU for better windows support that we > probably need to leverage. > > PeterFor the design doc of how we unified POSIX and Windows SIDs in the underlying filesystem and kernel see: opensolaris.org/os/community/arc/caselog/2007/064/final-materials/spec-txt/ The FUID system described there is implemented in ZFS now. If you are doing work to add user accounting to ZFS/DMU (which is wonderful news) all of that should be implemented using the FUID tables, which will result in it working transparently for both POSIX and Windows IDs. Any upstack expressions of this that are meant to be persistent for customers (i.e. a report) or sent over the wire should express the accounted- for entities using the generic string form of SIDs that I describe in there. -Mike -- Mike Shapiro, Sun Microsystems Fishworks. blogs.sun.com/mws/
I''d suggest working with Matt Ahrens on this. Jeff On Sun, Jun 01, 2008 at 10:26:41AM +0800, Peter Braam wrote:> Jeff - > > could you get in touch with Nikita and Ricardo and assist them with a draft > of quota design for the DMU. Nikita has some interesting API proposals, but > there are some pretty deep ZFS issues involved where help would be welcome, > as far as I can see. > > Just as a heads up, quota in systems like Lustre is quite a difficult issue, > as many servers contribute to quota usage and this needs "acquire", and > "release" of quota in reasonable chunks to avoid the server server protocol > getting too chatty. > > Thank you for your help! > > Peter > > > On 5/28/08 10:54 PM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote: > > > Ricardo M. Correia writes: > >> On Ter, 2008-05-27 at 07:28 +0800, Peter Braam wrote: > >> > >>>> Going aside, if I were designing quota from the scratch right now, I > >>>> would implement it completely inside of Lustre. All that is needed for > >>>> such an implementation is a set of call-backs that local file-system > >>>> invokes when it allocates/frees blocks (or inodes) for a given > >>>> object. Lustre would use these call-backs to transactionally update > >>>> local quota in its own format. That would save us a lot of hassle we > >>>> have dealing with the changing kernel quota interfaces, uid re-mappings, > >>>> and subtle differences between quota implementations on a different file > >>>> systems. > >>> > >>> ======> IMPORTANT: get in touch with Jeff Bonwick now, let''s get quota > >>> implemented in this way in DMU then. > >> > >> > >> I think this was proposed by Alex before, but AFAIU the conclusion is > >> that this was not possible to do with ZFS (or at least, not easy to do). > >> > >> The problem is that ZFS uses delayed allocations, i.e., allocations > >> occur long after a transaction group has been closed, and therefore we > >> can''t transactionally keep track of allocated space because by the time > >> the callbacks were called we are not allowed to write to the transaction > >> group anymore, since another 2 txgs could have been opened already. > > > > But that problem has to be solved anyway to implement per-user quotas > > for ZFS, correct? > > > > One possible solution I see is to use something like ZIL to log > > operations in the context of current transaction group. This log can be > > replayed during mount to update quota file. > > > >> > >> Since this couldn''t be done transactionally, if the node crashes, there > >> would be no way of knowing how many blocks had been allocated on the > >> latest (actually, the latest 2) committed transaction groups.. > >> > >> Regards, > >> Ricardo > > > > Nikita. > >
Jeff Bonwick writes: > I''d suggest working with Matt Ahrens on this. Hello, we were discussing recently what is needed from the DMU to implement quotas and other forms of space accounting. Our basic premise is that it is desirable to keep DMU part of the quota support at minimum, and to implement only mechanism here, leaving policy to the upper layers. Questions of what exactly constitutes disk space usage for quota purposes (e.g., whether and how snapshots have to be accounted for, etc.) are orthogonal to the present interface discussion; it''s assumed that dnode_phys_t::dn_used can be used for this. Two main use cases are - user and group quotas in ZPL, and - distributed cluster-wide quota in Lustre and pNFS. Surprisingly, it seems that these use cases can be implemented without any additional support from the DMU, except for the commit call-back, that is needed for other purposes too. General idea is that DMU user (ZPL or Lustre MDD module) maintains its own data-base mapping quota consumers (user and group identifiers) to their current space usage. A mechanism is needed to keep this data-base up to date. To this end, user keeps in memory a list of all DMU objects to which space was allocated or deallocated ("pending list"). User can do this, provided it has full control of the DMU (which holds for both use cases above). See below on how pending list is truncated. For each object, pending list also records its "current space usage" (dnode_phys_t::dn_used), and there is at most one record for a given object in this list. When a transaction group is about to be closed, user finds all objects in the pending list, belonging to this or previous transaction groups and, in the context of this transaction group, appends to a special "pending file" a record, containing (object id, current space usage) Next, transaction group is synced, disk space is actually allocated to the objects, and dnode_phys_t::dn_used is modified. When transaction group has committed, commit call-back is invoked. In this call-back user scans pending list, and updates its internal quota data-base: foreach (object, space_usage) in list { upd_quota(object->owner, dnode->dn_used - space_usage); upd_quota(object->group, dnode->dn_used - space_usage); remove_record_from_list(); } Updates to the user''s quota data-base are done in the context of currently open transaction. When DMU starts (possibly after a crash), the same loop as above is executed for all records in the pending file (as visible in last committed transaction group). Note that this loop is idempotent: executing it second time after first execution successfully committed has no result. In other words, the idea is to update quota data-base on the transaction group commit (so that updates to the data-base go into the transaction group in the "future" w.r.t. object operations that resulted in the space allocation), and to keep in pending file a list of objects modified during the last 2 transaction groups, so that this file can be used as a kind of redo log to update quota data-base in case of failure. > > Jeff Nikita.
On Sun, Jun 01, 2008 at 10:32:46AM +0800, Peter Braam wrote:> I am quite worried about the dynamic qunit patch. > I am not convinced I want smaller qunits to stick around. > > Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise > they create too much server - server overhead.I''ve _not_ been involved in the design of the adaptive qunit feature (the DLD pre-dates my involvement with Sun/CFS), but here is how it basically works: * if remaining quota space < 4 * #osts * current_qunit, the qunit size is divided by 2, * if remaining quota space > 8 * #osts * current_qunit, the qunit size is multiplied by 2. The initial bunit size (also the maximum value) is the default one (i.e. 128MB). The "4" and "8" can be tuned through /proc and there is a minimum value for qunit (by default, 1MB = PTLRPC_MAX_BRW_SIZE for bunit). Let''s consider a cluster with 500 OSTs: * the initial qunit size for a particular uid/gid is 128MB (unless the quota limit is too low) * when left_quota = 256GB, bunit is shrunk to 64MB * when left_quota = 128GB, bunit is shrunk to 32MB * when left_quota = 64GB, bunit is shrunk to 16MB * when left_quota = 32GB, bunit is shrunk to 8MB * when left_quota = 16GB, bunit is shrunk to 4MB * when left_quota = 8GB, bunit is shrunk to 2MB * when left_quota = 4GB, bunit is shrunk to 1MB Similarly, bunit is grown when the remaining quota space hits the same thresholds. The dynamic qunit patch also maintains an accurate accounting of how many threads are waiting for quota space from the master. Thus, slaves can ask for more than one qunit at a time in a single DQACQ request. IMO, the current algorithm/parameters are probably too aggressive and the correct tuning has not been found yet.> The cost of 100MB of disk space is barely more than a cent now; what are we trying > to address with tiny qunits?Today, a couple of customers are asking for accurate quotas. We should probably discuss with them to understand their motivations.>From my point of view, the interesting feature is not to support small quotalimits or tiny qunits, but to have the ability to adapt qunits for each uid/gid depending on how much free quota space remains. We can now increase qunit significantly without hurting quotas accuracy and performance should only be impacted when getting closer to the quota limit (that was the original goal in the DLD). That being said, adaptive qunits can be disabled easily by setting the mininum qunit size to the default qunit size.> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to > 100TB/sec in I/O. Calculate quota RPC traffic from that. A server cannot > handle more than 15,000 RPC''s / sec. > > No arguing, or opinions here, numbers please.With static qunits: 100TB/s / default_bunit_size ~ 1,000,000 RPCs / sec To get below the 15,000 RPCs/s, we should increase bunit to ~6.7GB. If each OST acquires 1 qunit ahead of time w/o actually using it, we "leak" 6.7GB * 5,000 OSTs = 33.5TB. With adaptive qunits, we can set default bunit to a larger value (e.g. 10GB) and the mininum bunit to 100MB. This way, quotas can remain "accurate" (maximum leak is 500GB) and performane would be impacted (more RPCs sent) only when getting close to the quota limit. However, the current shrink/enlarge algorithm is definitely not suitable for such a big cluster since it decreases qunit too quickly.> The original design I did 4 years ago limited quota calls from one OSS to the > master to one per second. > Qunits were made adaptive without solid reasoning or design.IMHO, adaptive qunits is not such a bad feature, even if there is definitely room for improvements. Johann
On Jun 01, 2008 10:32 +0800, Peter J. Braam wrote:> I am quite worried about the dynamic qunit patch. > I am not convinced I want smaller qunits to stick around. > > Please PROVE RIGOROUSLY that qunits are grow large quickly again, otherwise > they create too much server - server overhead. The cost of 100MB of disk > space is barely more than a cent now; what are we trying to address withtiny > qunits? > > Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to > 100TB/sec in I/O. Calculate quota RPC traffic from that. A server cannot > handle more than 15,000 RPC''s / sec. > > No arguing, or opinions here, numbers please. The original design I did 4 > years ago limited quota calls from one OSS to the master to one per second. > Qunits were made adaptive without solid reasoning or design.Just a note - it isn''t only shrinking of qunits that is possible, but also growth of qunits. I think there was also work done to allow recall of qunits from the servers, but I''m not sure if it was landed into CVS. If we are significantly re-architecting quotas, I''d suggest that we also re-implement grants at the same time and use the DLM to do both of them. This way we can have grant + quota on a per-file basis (quota + grant are given to clients as part of extent lock LVB), and are also able to recall quota + grant. We may not even want to have separate quota+grant, since we track the ownership of files on the OSTs and space allocation is done on a per-file basis. It would be possible, for example, to take a user''s whole quota from the master, split it evenly into "num_osts * 2" chunks at mount time to pass to the OSTs, they further grant it to clients when they request extent locks, and then avoid ALL master->OST quota RPCs unless that user actually got close to exceeding their quota, either granting out some of the remaining "num_osts" qunits or recalling some of the outstanding quota (possibly via lock "conversion" to avoid revoking the quota lock). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nikita Danilov wrote:> Jeff Bonwick writes: > > I''d suggest working with Matt Ahrens on this. > > Hello, > > we were discussing recently what is needed from the DMU to implement quotas > and other forms of space accounting. Our basic premise is that it is desirable > to keep DMU part of the quota support at minimum, and to implement only > mechanism here, leaving policy to the upper layers.I agree with this premise. However, your proposed implementation (especially the asynchronous update mechanism and associated pending file) seems unnecessarily complicated. I would suggest that we simply update a "database" (eg. ZAP object or sparse array) of userid -> space usage from syncing context when the space is allocated/freed (ie, dsl_dataset_block_{born,kill}). I believe that the problems this presents[*] will be more tractable than the method you outlined. --matt [*] eg, if the DB object is stored in the user''s objset, then updating it in syncing context may be problematic. if it is stored in the MOS, carrying it along when doing snapshot operations will be painful (snapshot, clone, send, recv, rollback, etc).
Matthew Ahrens writes: > Nikita Danilov wrote: > > Jeff Bonwick writes: > > > I''d suggest working with Matt Ahrens on this. > > > > Hello, > > > > we were discussing recently what is needed from the DMU to implement quotas > > and other forms of space accounting. Our basic premise is that it is desirable > > to keep DMU part of the quota support at minimum, and to implement only > > mechanism here, leaving policy to the upper layers. > > I agree with this premise. However, your proposed implementation (especially > the asynchronous update mechanism and associated pending file) seems > unnecessarily complicated. > > I would suggest that we simply update a "database" (eg. ZAP object or sparse > array) of userid -> space usage from syncing context when the space is > allocated/freed (ie, dsl_dataset_block_{born,kill}). I believe that the > problems this presents[*] will be more tractable than the method you outlined. Indeed, this solution is much simpler, and it was considered initially. I see following drawbacks in it: - a notion of a user identifier (or some opaque identifier) has to be introduced in DMU interface. DMU doesn''t interpret these identifiers in any way, except for using them as keys in a space usage database. A set of these identifiers has to be passed to every DMU entry point that might result in space allocation (a set is needed because there are group quotas, and to keep interface more or less generic). - an implementation of chown, chgrp, and distributed quota require DMU user to modify this database. Also, an interface to iterate over this database is most likely needed for things like distributed fsck, and user level quote reporting tools. It seems that it would be quite difficult to encapsulate such a database within DMU. > > --matt > > [*] eg, if the DB object is stored in the user''s objset, then updating it in > syncing context may be problematic. if it is stored in the MOS, carrying it The proposal was to update the database in the context of currently open transaction group. That is, when transaction group T has just committed, commit call-back is invoked and the database is updated in the context of some transaction belonging to transaction group T + 2 (T + 1 being in sync). It is because of this that pending file has to keep track of objects from two last transaction groups. > along when doing snapshot operations will be painful (snapshot, clone, send, > recv, rollback, etc). Nikita.
>-----Original Message----- >From: lustre-devel-bounces at lists.lustre.org >[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas Dilger >Sent: Tuesday, June 03, 2008 7:25 AM >To: Peter Braam >Cc: Bryon Neitzel; Johann Lombardi; Peter Bojanic; Jessica A. Johnson; Eric >Barton; Nikita Danilov; lustre-devel at lists.lustre.org >Subject: Re: [Lustre-devel] Moving forward on Quotas > >On Jun 01, 2008 10:32 +0800, Peter J. Braam wrote: >> I am quite worried about the dynamic qunit patch. >> I am not convinced I want smaller qunits to stick around. >> >> Please PROVE RIGOROUSLY that qunits are grow large quickly again, >otherwise >> they create too much server - server overhead. The cost of 100MB of disk >> space is barely more than a cent now; what are we trying to addresswithtiny>> qunits? >> >> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to >> 100TB/sec in I/O. Calculate quota RPC traffic from that. A servercannot>> handle more than 15,000 RPC''s / sec. >> >> No arguing, or opinions here, numbers please. The original design I did4>> years ago limited quota calls from one OSS to the master to one persecond.>> Qunits were made adaptive without solid reasoning or design. > >Just a note - it isn''t only shrinking of qunits that is possible, but also >growth of qunits. I think there was also work done to allow recall of >qunits from the servers, but I''m not sure if it was landed into CVS.Yes, it has. In order to prevent ping-pong effect, if qunit is reduced, qunit _only_ could be Increased after the_latest_qunit_reduction + lqc_switch_seconds(default is 300 secs) . At designing, we think accuracy is more urgent(otherwise, users will see earlier -EDQUOT), so decreasing can be done any time, but increasing has this limitation. tianzy
Here is some more guidance for thinking about the Lustre quota design: Adaptive qunits are great, but all I see is kind of a hack attempting to get this right instead of a good design. Here are some use cases you need to address, and hopefully address with existing infrastructure. (A) You need callbacks to change it, so that when it shrinks clients can give up quota. (B) mechanisms to recover the correct value if a client reconnects, or master reboots. Starting from a hard coded default value is wrong. If it''s global, then you''d need to store this in the configuration log so that it can be re-read and managed when it changes, using the config log. If it is a per user qunit then we may need an entirely new, similar mechanism. It probably is, and this is what worries me - it''s a huge amount of work to get this right. Doing this is a LOT of work, and unless you do it right the implementation will see a similar pattern of problems with customers as the previous one. So I want to continue to challenge you by asking if there isn''t a quota solution that doesn''t require adaptive behavior, at the expense of small amounts of unmanaged space. Peter On 6/3/08 5:49 PM, "Landen tian" <Zhiyong.Tian at Sun.COM> wrote:>> -----Original Message----- >> From: lustre-devel-bounces at lists.lustre.org >> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas Dilger >> Sent: Tuesday, June 03, 2008 7:25 AM >> To: Peter Braam >> Cc: Bryon Neitzel; Johann Lombardi; Peter Bojanic; Jessica A. Johnson; Eric >> Barton; Nikita Danilov; lustre-devel at lists.lustre.org >> Subject: Re: [Lustre-devel] Moving forward on Quotas >> >> On Jun 01, 2008 10:32 +0800, Peter J. Braam wrote: >>> I am quite worried about the dynamic qunit patch. >>> I am not convinced I want smaller qunits to stick around. >>> >>> Please PROVE RIGOROUSLY that qunits are grow large quickly again, >> otherwise >>> they create too much server - server overhead. The cost of 100MB of disk >>> space is barely more than a cent now; what are we trying to address > withtiny >>> qunits? >>> >>> Plan for 5000 OSS servers at the minimum and 1,000,000 clients, and up to >>> 100TB/sec in I/O. Calculate quota RPC traffic from that. A server > cannot >>> handle more than 15,000 RPC''s / sec. >>> >>> No arguing, or opinions here, numbers please. The original design I did > 4 >>> years ago limited quota calls from one OSS to the master to one per > second. >>> Qunits were made adaptive without solid reasoning or design. >> >> Just a note - it isn''t only shrinking of qunits that is possible, but also >> growth of qunits. I think there was also work done to allow recall of >> qunits from the servers, but I''m not sure if it was landed into CVS. > > Yes, it has. In order to prevent ping-pong effect, if qunit is reduced, > qunit _only_ could be > Increased after the_latest_qunit_reduction + lqc_switch_seconds(default is > 300 secs) . > At designing, we think accuracy is more urgent(otherwise, users will see > earlier -EDQUOT), > so decreasing can be done any time, but increasing has this limitation. > > tianzy > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
>-----Original Message----- >From: lustre-devel-bounces at lists.lustre.org >[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Peter Braam >Sent: Wednesday, June 04, 2008 9:24 AM >To: Landen tian; ''Andreas Dilger'' >Cc: ''Bryon Neitzel''; ''Johann Lombardi''; ''Peter Bojanic''; ''Jessica A.Johnson''; ''Eric>Barton''; ''Nikita Danilov''; lustre-devel at lists.lustre.org >Subject: Re: [Lustre-devel] Moving forward on Quotas > >Here is some more guidance for thinking about the Lustre quota design: > >Adaptive qunits are great, but all I see is kind of a hack attempting toget>this right instead of a good design. Here are some use cases you need to >address, and hopefully address with existing infrastructure. > >(A) You need callbacks to change it, so that when it shrinks clients can >give up quota.All remained quota on quota slaves(osts etc) must be kept between [0.5qunit, 1.5qunit]. If quota slaves get a shrunken qunit, it will check if left quota on itself satisfies this limitation. If not, it will release some quota. As we can''t predict users will use which ost, we will give a qunit to every ost at the beginning so that they can use it when writes relative to quota comes later. So if we want to shrink qunit, we need shrink all osts.> >(B) mechanisms to recover the correct value if a client reconnects, or >master reboots. > >Starting from a hard coded default value is wrong. If it''s global, then >you''d need to store this in the configuration log so that it can be re-read >and managed when it changes, using the config log.When every quota req arrives, quota master(mds) will recalculate the qunit to decide if it would enlarge or shrink a qunit to a proper value(this computing is simple, the cost is low). After rebooting, mds will know the proper qunit after the first quota req is finished; As current qunit will be contained in the reply of quota req, osts will know current qunit after a quota req is finished. In this way, no matter a client reconnects or master reboot, the proper quota info is spread over all cluster gradually. Now, the qunit info only is recoreded in memory, not disk. After rebooting or reconnecting, it will be recovered in running time. Certainly, we can record it in the config log. It would be easier to reconnect and reboot, but there may be hundreds or thousands of quota usrs/grps In the system, and we may flush this info to log from time to time because this info is dynamic. We have gains and losses, please advice if it is worthy.> >If it is a per user qunit then we may need an entirely new, similar >mechanism. It probably is, and this is what worries me - it''s a hugeamount>of work to get this right.Yeah, adaptive qunit is per user and per group.>Doing this is a LOT of work, and unless you do it right the implementation >will see a similar pattern of problems with customers as the previous one. > >So I want to continue to challenge you by asking if there isn''t a quota >solution that doesn''t require adaptive behavior, at the expense of small >amounts of unmanaged space.I guess we need borrow mechanism from ldlm to achieve it as Johann said before. tianzy
On Mon, Jun 02, 2008 at 05:24:33PM -0600, Andreas Dilger wrote:> Just a note - it isn''t only shrinking of qunits that is possible, but also > growth of qunits. I think there was also work done to allow recall of > qunits from the servers, but I''m not sure if it was landed into CVS.Yes, this is included in the adaptive qunit patch. When qunit is shrunk, the new value is broadcasted to the slaves which release the unused qunits.> If we are significantly re-architecting quotas, I''d suggest that we also > re-implement grants at the same time and use the DLM to do both of them. > This way we can have grant + quota on a per-file basis (quota + grant are > given to clients as part of extent lock LVB), and are also able to > recall quota + grant. We may not even want to have separate quota+grant, > since we track the ownership of files on the OSTs and space allocation > is done on a per-file basis. > > It would be possible, for example, to take a user''s whole quota from > the master, split it evenly into "num_osts * 2" chunks at mount time > to pass to the OSTs, they further grant it to clients when they request > extent locks, and then avoid ALL master->OST quota RPCs unless that user > actually got close to exceeding their quota, either granting out some > of the remaining "num_osts" qunits or recalling some of the outstanding > quota (possibly via lock "conversion" to avoid revoking the quota lock).I''ve been in favor of such an architecture too (see my previous emails). The only "problem" is that it requires quite a lot of work. Johann
Nikita Danilov wrote:> Matthew Ahrens writes: > > Nikita Danilov wrote: > > > Jeff Bonwick writes: > > > > I''d suggest working with Matt Ahrens on this. > > > > > > Hello, > > > > > > we were discussing recently what is needed from the DMU to implement quotas > > > and other forms of space accounting. Our basic premise is that it is desirable > > > to keep DMU part of the quota support at minimum, and to implement only > > > mechanism here, leaving policy to the upper layers. > > > > I agree with this premise. However, your proposed implementation (especially > > the asynchronous update mechanism and associated pending file) seems > > unnecessarily complicated. > > > > I would suggest that we simply update a "database" (eg. ZAP object or sparse > > array) of userid -> space usage from syncing context when the space is > > allocated/freed (ie, dsl_dataset_block_{born,kill}). I believe that the > > problems this presents[*] will be more tractable than the method you outlined. > > Indeed, this solution is much simpler, and it was considered > initially. I see following drawbacks in it:Agreed, those are possible drawbacks, depending on the implementation. For example, if the DB object is stored in the user''s objset (which is preferable for other reasons) then I suspect that the two drawbacks you mention below will be no worse than in your proposal. --matt> - a notion of a user identifier (or some opaque identifier) has to > be introduced in DMU interface. DMU doesn''t interpret these > identifiers in any way, except for using them as keys in a space > usage database. A set of these identifiers has to be passed to > every DMU entry point that might result in space allocation (a > set is needed because there are group quotas, and to keep > interface more or less generic). > > - an implementation of chown, chgrp, and distributed quota require > DMU user to modify this database. Also, an interface to iterate > over this database is most likely needed for things like > distributed fsck, and user level quote reporting tools. It seems > that it would be quite difficult to encapsulate such a database > within DMU. > > > > > --matt > > > > [*] eg, if the DB object is stored in the user''s objset, then updating it in > > syncing context may be problematic. if it is stored in the MOS, carrying it > > The proposal was to update the database in the context of currently open > transaction group. That is, when transaction group T has just committed, > commit call-back is invoked and the database is updated in the context > of some transaction belonging to transaction group T + 2 (T + 1 being in > sync). It is because of this that pending file has to keep track of > objects from two last transaction groups. > > > along when doing snapshot operations will be painful (snapshot, clone, send, > > recv, rollback, etc). > > Nikita.
Quota are determined by the owner of the file, not by the uid of the process. We have been propagating owners / gid of files from MDS to OSS for this reason. I don''t understand the issue. Peter> > I''ve discussed with Fanyong and imo, the uid mapping issue is clearly a > showstopper for now. On HEAD, the client-side uid/gid packed by GSS is > translated to a sevice-side uid/gid. This mapping exists on the MDS, but not > on the OSSs. The problem is that for quotas, OSSs must be aware of this > mapping. > We should find a solution to this problem (not straightforward) which needs to > be addressed anyway even for DMU support. > > In general, I don''t think that much of this work will be "throw away". > > Cheers, > Johann
On Thu, Jun 05, 2008 at 04:09:41AM -0700, Peter Braam wrote:> Quota are determined by the owner of the file, not by the uid of the > process.In my understanding, the client is not aware of the server-side uid/gid and the file owner returned to the client is translated too. Please let me know if I misunderstood how the uid/gid mapping feature works.> We have been propagating owners / gid of files from MDS to OSS for this > reason. > > I don''t understand the issue.The problem is that the owner / group is propagated to the OSTs through the client the first time it sends a bulk write rpc (by setting the OBD_MD_FLUID/GID flags) and those identifiers - supplied by the client - must be translated too. Johann
Well, this protocol hasn''t been published yet; why not include the server side uid / gid then? But more seriously, how is this encoded in such a way that the OST can trust the information - it must be in the capability too? Peter On 6/5/08 5:27 AM, "Johann Lombardi" <johann at sun.com> wrote:> On Thu, Jun 05, 2008 at 04:09:41AM -0700, Peter Braam wrote: >> Quota are determined by the owner of the file, not by the uid of the >> process. > > In my understanding, the client is not aware of the server-side uid/gid and > the file owner returned to the client is translated too. Please let me know > if I misunderstood how the uid/gid mapping feature works. > >> We have been propagating owners / gid of files from MDS to OSS for this >> reason. >> >> I don''t understand the issue. > > The problem is that the owner / group is propagated to the OSTs through the > client the first time it sends a bulk write rpc (by setting the > OBD_MD_FLUID/GID flags) and those identifiers - supplied by the client - > must be translated too. > > Johann > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote:> Well, this protocol hasn''t been published yet; why not include the server > side uid / gid then?I understood from Fanyong that according to the remote client design requirements, we should not allow the remote client to access the server-side uid/gid mapping.> But more seriously, how is this encoded in such a way that the OST can trust > the information - it must be in the capability too?hmm, I don''t see any capability associated with this. Johann
On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote:> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >> Well, this protocol hasn''t been published yet; why not include the server >> side uid / gid then? > > I understood from Fanyong that according to the remote client design > requirements, we should not allow the remote client to access the server-side > uid/gid mapping.You have two choices: break that rule OR let the MDS server do the ownership changes. It''s not complicated, just make a choice.> >> But more seriously, how is this encoded in such a way that the OST can trust >> the information - it must be in the capability too? > > hmm, I don''t see any capability associated with this.We had better find it, otherwise there is a security hole.> > Johann > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Peter Braam ??:> > On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote: > > >> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >> >>> Well, this protocol hasn''t been published yet; why not include the server >>> side uid / gid then? >>> >> I understood from Fanyong that according to the remote client design >> requirements, we should not allow the remote client to access the server-side >> uid/gid mapping. >> > > You have two choices: break that rule OR let the MDS server do the ownership > changes. It''s not complicated, just make a choice. > > >>> But more seriously, how is this encoded in such a way that the OST can trust >>> the information - it must be in the capability too? >>> >> hmm, I don''t see any capability associated with this. >> > > We had better find it, otherwise there is a security hole. > > >I think we can divide clients into two sorts: trusted and untrusted. The client reliability is defined by administrator. Remote client should be counted as untrusted one. The best simple way is that: local client is trusted, remote client is untrusted. For the trusted ones, disable capability, OSS set file "uid" and "gid" with "oa.o_uid" and "oa.o_gid". It is the current using means. For the untrusted ones, enable capability, OSS set file "uid" and "gid" which contain in the OSS capability got from MDS when open. With OSS capability enabled will affect performance, and current capability''s design does not contain the consideration for that. We can fix the OSS capability design in the task "o3_se_capa_review" Note: since capability feature is time-consuming, we want to support enforcing capabilities on selected clients (or somewhat MDS/OSS capability should aim at remote client). On he other hand, enforcing capabilities on selected clients can simply the capability interoperability between HEAD and b1_6. If this idea can be approved, then the current remote client uid mapping rules can be unchanged, uid mapping on OST is unnecessary also. Thanks! -- Fan Yong>> Johann >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > >
Thanks for explaining. The trust concept you outline is definitely not acceptable - we need a capability for all access and modifications done through clients on behalf of a capability. Even changes made by the MDS need to be secured - that can be through a kerberos connection, and again, not through blind trust. Would you please send the capability HLD to me? Thanks! - Peter - On 6/9/08 2:52 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote:> Peter Braam ??: >> >> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote: >> >> >>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >>> >>>> Well, this protocol hasn''t been published yet; why not include the server >>>> side uid / gid then? >>>> >>> I understood from Fanyong that according to the remote client design >>> requirements, we should not allow the remote client to access the >>> server-side >>> uid/gid mapping. >>> >> >> You have two choices: break that rule OR let the MDS server do the ownership >> changes. It''s not complicated, just make a choice. >> >> >>>> But more seriously, how is this encoded in such a way that the OST can >>>> trust >>>> the information - it must be in the capability too? >>>> >>> hmm, I don''t see any capability associated with this. >>> >> >> We had better find it, otherwise there is a security hole. >> >> >> > I think we can divide clients into two sorts: trusted and untrusted. > The client reliability is defined by administrator. Remote client > should be counted as untrusted one. The best simple way is that: > local client is trusted, remote client is untrusted. > > For the trusted ones, disable capability, OSS set file "uid" and "gid" > with "oa.o_uid" and "oa.o_gid". It is the current using means. > For the untrusted ones, enable capability, OSS set file "uid" and "gid" > which contain in the OSS capability got from MDS when open. > With OSS capability enabled will affect performance, and current > capability''s design does not contain the consideration for that. We > can fix the OSS capability design in the task "o3_se_capa_review" > > Note: since capability feature is time-consuming, we want to support > enforcing capabilities on selected clients (or somewhat MDS/OSS > capability should aim at remote client). On he other hand, enforcing > capabilities on selected clients can simply the capability interoperability > between HEAD and b1_6. > > If this idea can be approved, then the current remote client uid mapping > rules can be unchanged, uid mapping on OST is unnecessary also. > > > Thanks! > -- > Fan Yong >>> Johann >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> >> >> >> >
Peter Braam ??:> Thanks for explaining. > > The trust concept you outline is definitely not acceptable - we need a > capability for all access and modifications done through clients on behalf > of a capability. > > Even changes made by the MDS need to be secured - that can be through a > kerberos connection, and again, not through blind trust. >The client reliability (trusted or untrusted) and the MDS/OSS capabilities are not conflict. They can be combined together as mentioned in draft of new MDS/OSS capabilities HLD. capabilities.lyx is the old HLD capability_hld.lyx is the new HLD which is in patch inspection and fix. Thanks! -- Fan Yong> Would you please send the capability HLD to me? Thanks! > > - Peter - > > > On 6/9/08 2:52 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote: > > >> Peter Braam ??: >> >>> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote: >>> >>> >>> >>>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >>>> >>>> >>>>> Well, this protocol hasn''t been published yet; why not include the server >>>>> side uid / gid then? >>>>> >>>>> >>>> I understood from Fanyong that according to the remote client design >>>> requirements, we should not allow the remote client to access the >>>> server-side >>>> uid/gid mapping. >>>> >>>> >>> You have two choices: break that rule OR let the MDS server do the ownership >>> changes. It''s not complicated, just make a choice. >>> >>> >>> >>>>> But more seriously, how is this encoded in such a way that the OST can >>>>> trust >>>>> the information - it must be in the capability too? >>>>> >>>>> >>>> hmm, I don''t see any capability associated with this. >>>> >>>> >>> We had better find it, otherwise there is a security hole. >>> >>> >>> >>> >> I think we can divide clients into two sorts: trusted and untrusted. >> The client reliability is defined by administrator. Remote client >> should be counted as untrusted one. The best simple way is that: >> local client is trusted, remote client is untrusted. >> >> For the trusted ones, disable capability, OSS set file "uid" and "gid" >> with "oa.o_uid" and "oa.o_gid". It is the current using means. >> For the untrusted ones, enable capability, OSS set file "uid" and "gid" >> which contain in the OSS capability got from MDS when open. >> With OSS capability enabled will affect performance, and current >> capability''s design does not contain the consideration for that. We >> can fix the OSS capability design in the task "o3_se_capa_review" >> >> Note: since capability feature is time-consuming, we want to support >> enforcing capabilities on selected clients (or somewhat MDS/OSS >> capability should aim at remote client). On he other hand, enforcing >> capabilities on selected clients can simply the capability interoperability >> between HEAD and b1_6. >> >> If this idea can be approved, then the current remote client uid mapping >> rules can be unchanged, uid mapping on OST is unnecessary also. >> >> >> Thanks! >> -- >> Fan Yong >> >>>> Johann >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>>> >>>> >>> >>> > > >-------------- next part -------------- A non-text attachment was scrubbed... Name: capabilities.lyx Type: application/x-lyx Size: 8268 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080610/c0ee7ff3/attachment-0002.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: capability_hld.lyx Type: application/x-lyx Size: 20857 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080610/c0ee7ff3/attachment-0003.bin
Peter Braam ??:> > On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote: > > >> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >> >>> Well, this protocol hasn''t been published yet; why not include the server >>> side uid / gid then? >>> >> I understood from Fanyong that according to the remote client design >> requirements, we should not allow the remote client to access the server-side >> uid/gid mapping. >> > > You have two choices: break that rule OR let the MDS server do the ownership > changes. It''s not complicated, just make a choice. >1) MDS changes file owner for objects on OSS when create. It introduces one additional RPC between MDS and OSS for each objects, which causes the per-create meaningless. Maybe we can not accept the creation performance undre such case. An improvement for that is that we can initialize all the pre-create objects on OSS with the same uid/gid as the current created file. And the succedent creations check the per-created object''s uid/gid, if they match the owner''s uid/gid, then the additional setattr to OSS is unnecessary, otherwise, chown to OSS will be sent. The best case is equal to the current implementation; the worst case is equal to no such improvement. Under creation performance sensitivity case, maybe HPC or some benchmark test. There are few uid/gid context switch. So I think such improvement can match most creation performance requirement. 2) Pack file owner into OSS capability when write to OSS. According to our UID mapping rules, we do not want the client (especially the untrusted client, maybe remote client) to know which the server-side uid/gid are mapped. For that, dynamic UID mapping are used, and some complex remote acl operations are implemented. So if file real owner (server-side uid/gid) need to be packed in OSS capability back to client, bad better to encrypt them to match our UID mapping rules. Otherwise, most of our effort for that before are broken. On the other hand, MDS/OSS capabilities use HMAC-SHA1 to prevent from being modified or fabricated. Signing MAC is somewhat time-consuming, and uid/gid encryption are similar, so the performance will become worse. To implement server-side uid/gid encryption and decryption, need to design some key-update mechanism also. So I do not think it is a good choice. 3) Establish UID mapping on OSS also, just like MDS does. It has the same principle as done on MDS which needs lower layer GSS support. Many codes can be shared between MDS and OSS for UID mapping. It is the most complete solution, little affect on creation performance and UID mapping rules. The shortcoming is much changes on OSS, how much GSS effort for that is not well estimated yet.> >>> But more seriously, how is this encoded in such a way that the OST can trust >>> the information - it must be in the capability too? >>> >> hmm, I don''t see any capability associated with this. >> > > We had better find it, otherwise there is a security hole. > >Root user''s super permission should be considered for OSS capability. On client-side, the opened file descriptor can be shared by different threads, and the different threads maybe have different "uid". Consider the following case: Root user writes a file through other normal user (Tom) opened file descriptor. The root user can breaks through Tom''s quota limitation, he should tell OSS that he is a root user, but how can the OSS believe him? Such information should be contained in the OSS capability. (Note: on lustre 1.6, llite set "OBD_BRW_NOQUOTA" directly for that, which can be fabricated by baleful client) Best Regards! -- Fan Yong> >> Johann >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > >
I did not read the entire long message, but clearly the importance of authorizing operations on the OSS far exceeds that of exposing a uid to a client. Moreover, it doesn''t have to be exposed. If the capability contains the uid it can be encrypted with a key so that only the OSS can see it. This seems the preferred solution. Peter On 6/10/08 7:54 AM, "Yong Fan" <Yong.Fan at Sun.COM> wrote:> Peter Braam ??: >> >> On 6/6/08 1:33 AM, "Johann Lombardi" <johann at sun.com> wrote: >> >> >>> On Thu, Jun 05, 2008 at 06:45:36AM -0700, Peter Braam wrote: >>> >>>> Well, this protocol hasn''t been published yet; why not include the server >>>> side uid / gid then? >>>> >>> I understood from Fanyong that according to the remote client design >>> requirements, we should not allow the remote client to access the >>> server-side >>> uid/gid mapping. >>> >> >> You have two choices: break that rule OR let the MDS server do the ownership >> changes. It''s not complicated, just make a choice. >> > 1) MDS changes file owner for objects on OSS when create. > It introduces one additional RPC between MDS and OSS > for each objects, which causes the per-create meaningless. > Maybe we can not accept the creation performance undre > such case. > An improvement for that is that we can initialize all > the pre-create objects on OSS with the same uid/gid as > the current created file. And the succedent creations > check the per-created object''s uid/gid, if they match > the owner''s uid/gid, then the additional setattr to OSS > is unnecessary, otherwise, chown to OSS will be sent. > The best case is equal to the current implementation; > the worst case is equal to no such improvement. > Under creation performance sensitivity case, maybe HPC > or some benchmark test. There are few uid/gid context > switch. So I think such improvement can match most > creation performance requirement. > 2) Pack file owner into OSS capability when write to OSS. > According to our UID mapping rules, we do not want the > client (especially the untrusted client, maybe remote > client) to know which the server-side uid/gid are mapped. > For that, dynamic UID mapping are used, and some complex > remote acl operations are implemented. > So if file real owner (server-side uid/gid) need to be > packed in OSS capability back to client, bad better to > encrypt them to match our UID mapping rules. Otherwise, > most of our effort for that before are broken. On the > other hand, MDS/OSS capabilities use HMAC-SHA1 to prevent > from being modified or fabricated. Signing MAC is somewhat > time-consuming, and uid/gid encryption are similar, so the > performance will become worse. To implement server-side > uid/gid encryption and decryption, need to design some > key-update mechanism also. > So I do not think it is a good choice. > 3) Establish UID mapping on OSS also, just like MDS does. > It has the same principle as done on MDS which needs > lower layer GSS support. Many codes can be shared > between MDS and OSS for UID mapping. It is the most > complete solution, little affect on creation performance > and UID mapping rules. The shortcoming is much changes > on OSS, how much GSS effort for that is not well estimated > yet. >> >>>> But more seriously, how is this encoded in such a way that the OST can >>>> trust >>>> the information - it must be in the capability too? >>>> >>> hmm, I don''t see any capability associated with this. >>> >> >> We had better find it, otherwise there is a security hole. >> >> > Root user''s super permission should be considered for OSS > capability. On client-side, the opened file descriptor can > be shared by different threads, and the different threads > maybe have different "uid". Consider the following case: > Root user writes a file through other normal user (Tom) > opened file descriptor. The root user can breaks through > Tom''s quota limitation, he should tell OSS that he is a > root user, but how can the OSS believe him? Such information > should be contained in the OSS capability. > (Note: on lustre 1.6, llite set "OBD_BRW_NOQUOTA" directly > for that, which can be fabricated by baleful client) > > Best Regards! > -- > Fan Yong >> >>> Johann >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> >> >> >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel