------ Forwarded Message From: Johann Lombardi <johann at sun.com> Date: Mon, 26 May 2008 13:35:30 +0200 To: Nikita Danilov <Nikita.Danilov at Sun.COM> Cc: "Jessica A. Johnson" <Jessica.Johnson at Sun.COM>, Bryon Neitzel <Bryon.Neitzel at Sun.COM>, Eric Barton <eeb at bartonsoftware.com>, Peter Bojanic <Peter.Bojanic at Sun.COM>, <Peter.Braam at Sun.COM> Subject: Re: Moving forward on Quotas Hi all, On Sat, May 24, 2008 at 01:33:36AM +0400, Nikita Danilov wrote:> I think that to understand to what extent current quota architecture, > design, and implementation need to be changed, we have --among other > things-- to enumerate the problems that the current implementation has.For the record, I did a quota review with Peter Braam (report attached) back in March.> It would be very useful to get a list of issues from Johann (and maybe > Andrew?).Sure. Actually, they are several aspects to consider: ************************************************************** * Changes required to quotas because of architecture changes * ************************************************************** * item #1: Supporting quotas on HEAD (no CMD) The MDT has been rewritten, but the quota code must be modified to support the new framework. In addition, we were said not to land quota patches on HEAD until this gets fix (it was a mistake IMHO). So, we also have to port all quota patches from b1_6 to HEAD. I don''t expect this task to take a lot of time since there is no fundamental changes in the quota logic. IIRC, Fanyong is already working on this. * item #2: Supporting quotas with CMD The quota master is the only one having a global overview of the quota usages and limits. On b1_6, the quota master is the MDS and the quota slaves are the OSSs. The code is designed in theory to support several MDT slaves too, but some shortcuts have been taken and some additional work is needed to support an architecture with 1 quota master (one of the MDT) and several OSTs/MDTs slaves. * item #3: Supporting quotas with DMU ZFS does not support standard Unix quotas. Instead, it relies on fileset quotas. This is a problem because Lustre quotas are set on a per-uid/gid basis. To support ZFS, we are going to have to put OST objects in a dataset matching a dataset on the MDS. We also have to decide what kind of quota interface we want to have at the lustre level (do we still set quotas on uid/gid or do we switch to the dataset framework?). Things get more complicated if we want to support a MDS using ldiskfs and OSSs using ZFS (do we have to support this?). IMHO, in the future, Lustre will want to take advantage of the ZFS space reservation feature and since this also relies on dataset, I think we should adopt the ZFS quota framework at the lustre level too. That being said, my understanding of ZFS quotas is limited to this webpage: http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch0 5s06.html and I haven''t had the time to dig further. **************************************************** * Shortcomings of the current quota implementation * **************************************************** * issue #1: Performance scalability Performance with quotas enabled are currently good because a single quota master is powerful enough to process the quota acquire/release requests. However, we know that the quota master is going to become a bottleneck when increasing the number of OSTs. e.g.: 500 OSTs doing 2GB/s (~tera10) with a quota unit size of 100MB requires 10,000 quota RPCs to be processed by the quota master. Of course, we could decide to bump the default bunit, but the drawback is that it increases the granularity of quotas which is problematic given that quota space cannot be revoked (see issue #2). Another approach could be to take advantage of CMD and to spread the load across several quota masters. Distributing master could be done on a per-uid/gid/dataset basis, but we would still hit the same limitation if we want to reach 100+GB/s with a single uid/gid/dataset. More complicated algorithms can also be considered, at the price of increasing complexity. * issue #2: Quota accuracy When a slave runs out of its local quota, it sends an acquire request to the quota master. As I said earlier, the quota master is the only one having a global overview of what has been granted to slaves. If the master can satisfy the request, it grants a qunit (can be a number of blocks or inodes) to the slave. The problem is that an OST can return "quota exceeded" (=EDQUOT) whereas another OST is still having quotas. There is currently no callback to claim back the quota space that has been granted to a slave. Of course, this hurts quota accuracy and usually disturbs users who are accustomed to use quotas with local filesystems (users do not understand why they are getting EDQUOT while the disk usage is below the limit). The dynamic qunit patch (bug 10600) has improved the situation by decreasing qunit when the master gets closer to the quota limit, but some cases are still not addressed because there is still no way to claim back quota space granted to the slaves. * issue #3: Quota overruns Quotas are handled on the server side and the problem is that there are currently no interactions between the grant cache and quotas. It means that a client node can continue caching dirty data while the corresponding user is over quota on the server side. When the data are written back, the server is told that the writes have already been acknowledged to the application (by checking if OBD_BRW_FROM_GRANT is set) and thus accepts the write request even if the user is over quota. The server mentions in the bulk reply that the user is over quota and the client is then supposed to stop caching dirty data (until the server reports that the user is no longer over quota). The problem is that those quota overruns can be really significant since it depends on the number of clients: max_quota_overruns = number of OSTs * number of clients * max_dirty_mb e.g. = 500 * 1,000 * 32 = 16TB :( For now, only OSTs are concerned by this problem, but we will have the same problem with inodes when we have a metadata writeback cache. Fortunately, not all applications can run into this problem, but this can happen (actually, it is quite easy to reproduce with IOR/1 file per task). I''ve been thinking of 2 approaches to tackle this problem: - introduce some quota knowledge on the client side and modify the grant cache to take into account the uid/gid/dataset. - stop granting [0;EOF] locks when a user gets closer to the quota limit and only grant locks covering a region which fits within the remaining quota space. I''m discussing this solution with Oleg atm. Cheers, Johann ------ End of Forwarded Message -------------- next part -------------- A non-text attachment was scrubbed... Name: Quota.doc Type: application/octet-stream Size: 34304 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080527/4b3743ef/attachment-0001.obj
------ Forwarded Message From: Johann Lombardi <johann at sun.com> Date: Tue, 27 May 2008 00:47:22 +0200 To: Nikita Danilov <Nikita.Danilov at Sun.COM> Cc: "Jessica A. Johnson" <Jessica.Johnson at Sun.COM>, Bryon Neitzel <Bryon.Neitzel at Sun.COM>, Eric Barton <eeb at bartonsoftware.com>, Peter Bojanic <Peter.Bojanic at Sun.COM>, <Peter.Braam at Sun.COM> Subject: Re: Moving forward on Quotas On Mon, May 26, 2008 at 09:56:20PM +0400, Nikita Danilov wrote:> From reading quota hld it is not clear that master has necessary be an > mdt server.Indeed, the quota master could, in theory, be run on any nodes. The advantage of the MDS is that it already has a connection to each OST.> Given that ost''s are going to have mdt-like recovery in 2.0 > it seems reasonable to hash uid/gid across all osts, that act as masters > (additionally, it seems logical to handle disk block allocation on ost''s > rather than mdt''s). Or am I missing something here?It would mean that _each_ OSS has to establish a connection to all the OSTs. Do we already plan to do this in 2.0? I assume that it will probably be required anyway to support other striping patterns like RAID1 or RAID5. However, my concern is that, with OST pools, we should expect to see more and more configurations with heterogeneous OSSs. As a consequence, some uid/gid could end up with a very responsive quota master (connected to a fast network which can handle 10,000+ RPCs per second), whereas the quota master could become a bottleneck for others, depending on the hash function and the uid/gid number.> As per discussion with ZFS team, they are going to implement per-user > and per-group block quotas in ZFS (inode quotas make little sense for > ZFS).ah, great.> Going aside, if I were designing quota from the scratch right now, I > would implement it completely inside of Lustre. All that is needed for > such an implementation is a set of call-backs that local file-system > invokes when it allocates/frees blocks (or inodes) for a given > object. Lustre would use these call-backs to transactionally update > local quota in its own format. That would save us a lot of hassle we > have dealing with the changing kernel quota interfaces, uid re-mappings, > and subtle differences between quota implementations on a different file > systems.This would also mean that we have to rewrite many things that are provided by the kernel today (e.g. quotacheck, a tree to manage per-uid/gid quota data, ...). IMO, we should really weigh the pros and cons before doing so.> Additionally, this is better aligned with the way we handle access > control: MDT implements access control checks completely within MDD, > without relying on underlying file system.ok.> What strikes me in this description is how this is similar to DLM. It > almost looks like quota can be easily implemented as a special type of > lock, and DLM conflict resolution mechanism with cancellation AST''s can > be used to reclaim quota.Yes, that''s what I think too. I''ve also been discussing this with Oleg. Johann ------ End of Forwarded Message
Hi all: Besides what Johann said, current problems lquota faces are: Recovery: 1. the current recovery for lquota is to do "lfs quotacheck", which will count all blocks and inodes for all uid/gid. We should have a more elegant method for this. I means we can only do recovery for quota reqs which haven''t been synced between quota master and quota slaves. 2. customer prefer a fine-grain quotacheck. That means they want sth like "lfs quotacheck -u uid/-g gid". It can''t be done now because ldiskfs doesn''t support this. I hope DMU could in the future. 3. when quotacheck is running, users have many taboos, which are from ldiskfs. This is quoted from manual from "man 8 quotacheck" " It is strongly recommended to run quotacheck with quotas turned off for the filesystem. Oth- erwise, possible damage or loss to data in the quota files can result. It is also unwise to run quotacheck on a live filesystem as actual usage may change during the scan. To prevent this, quotacheck tries to remount the filesystem read-only before starting the scan. After the scan is done it remounts the filesystem read-write. You can disable this with option -m. You can also make quotacheck ignore the failure to remount the filesystem read-only with option -M." Zhiyong Landen Tian Sun Microsystems, Inc. 10/F Chuangxin Plaza, Tsinghua Science Park Bei Jing 100084 CN Phone x28801/+86-10-68093801 Email zhiyong.tian at Sun.COM>-----Original Message----- >From: lustre-devel-bounces at lists.lustre.org >[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Peter Braam >Sent: Tuesday, May 27, 2008 7:29 AM >To: lustre-devel at lists.lustre.org >Subject: [Lustre-devel] FW: Moving forward on Quotas > > >------ Forwarded Message >From: Johann Lombardi <johann at sun.com> >Date: Mon, 26 May 2008 13:35:30 +0200 >To: Nikita Danilov <Nikita.Danilov at Sun.COM> >Cc: "Jessica A. Johnson" <Jessica.Johnson at Sun.COM>, Bryon Neitzel ><Bryon.Neitzel at Sun.COM>, Eric Barton <eeb at bartonsoftware.com>, Peter >Bojanic ><Peter.Bojanic at Sun.COM>, <Peter.Braam at Sun.COM> >Subject: Re: Moving forward on Quotas > >Hi all, > >On Sat, May 24, 2008 at 01:33:36AM +0400, Nikita Danilov wrote: >> I think that to understand to what extent current quota architecture, >> design, and implementation need to be changed, we have --among other >> things-- to enumerate the problems that the current implementation has. > >For the record, I did a quota review with Peter Braam (report attached)back>in >March. > >> It would be very useful to get a list of issues from Johann (and maybe >> Andrew?). > >Sure. Actually, they are several aspects to consider: > >************************************************************** >* Changes required to quotas because of architecture changes * >************************************************************** > >* item #1: Supporting quotas on HEAD (no CMD) > >The MDT has been rewritten, but the quota code must be modified to support >the new framework. In addition, we were said not to land quota patches on >HEAD >until this gets fix (it was a mistake IMHO). So, we also have to port all >quota >patches from b1_6 to HEAD. >I don''t expect this task to take a lot of time since there is nofundamental>changes in the quota logic. IIRC, Fanyong is already working on this. > >* item #2: Supporting quotas with CMD > >The quota master is the only one having a global overview of the quota >usages >and limits. On b1_6, the quota master is the MDS and the quota slaves are >the >OSSs. The code is designed in theory to support several MDT slaves too, but >some >shortcuts have been taken and some additional work is needed to support an >architecture with 1 quota master (one of the MDT) and several OSTs/MDTs >slaves. > >* item #3: Supporting quotas with DMU > >ZFS does not support standard Unix quotas. Instead, it relies on fileset >quotas. >This is a problem because Lustre quotas are set on a per-uid/gid basis. >To support ZFS, we are going to have to put OST objects in a dataset >matching a >dataset on the MDS. >We also have to decide what kind of quota interface we want to have at the >lustre level (do we still set quotas on uid/gid or do we switch to the >dataset >framework?). Things get more complicated if we want to support a MDS using >ldiskfs and OSSs using ZFS (do we have to support this?). >IMHO, in the future, Lustre will want to take advantage of the ZFS space >reservation feature and since this also relies on dataset, I think weshould>adopt the ZFS quota framework at the lustre level too. >That being said, my understanding of ZFS quotas is limited to this webpage: >http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch0>5s06.html >and I haven''t had the time to dig further. > >**************************************************** >* Shortcomings of the current quota implementation * >**************************************************** > >* issue #1: Performance scalability > >Performance with quotas enabled are currently good because a single quota >master >is powerful enough to process the quota acquire/release requests. However, >we >know that the quota master is going to become a bottleneck when increasing >the >number of OSTs. >e.g.: 500 OSTs doing 2GB/s (~tera10) with a quota unit size of 100MB >requires >10,000 quota RPCs to be processed by the quota master. >Of course, we could decide to bump the default bunit, but the drawback is >that >it increases the granularity of quotas which is problematic given thatquota>space cannot be revoked (see issue #2). >Another approach could be to take advantage of CMD and to spread the load >across >several quota masters. Distributing master could be done on a >per-uid/gid/dataset >basis, but we would still hit the same limitation if we want to reach >100+GB/s >with a single uid/gid/dataset. More complicated algorithms can also be >considered, >at the price of increasing complexity. > >* issue #2: Quota accuracy > >When a slave runs out of its local quota, it sends an acquire request tothe>quota master. As I said earlier, the quota master is the only one having a >global overview of what has been granted to slaves. If the master can >satisfy >the request, it grants a qunit (can be a number of blocks or inodes) to the >slave. The problem is that an OST can return "quota exceeded" (=EDQUOT) >whereas >another OST is still having quotas. There is currently no callback to claim >back the quota space that has been granted to a slave. >Of course, this hurts quota accuracy and usually disturbs users who are >accustomed to use quotas with local filesystems (users do not understandwhy>they are getting EDQUOT while the disk usage is below the limit). >The dynamic qunit patch (bug 10600) has improved the situation bydecreasing>qunit when the master gets closer to the quota limit, but some cases are >still >not addressed because there is still no way to claim back quota space >granted to >the slaves. > >* issue #3: Quota overruns > >Quotas are handled on the server side and the problem is that there are >currently no interactions between the grant cache and quotas. It means that >a client node can continue caching dirty data while the corresponding user >is over quota on the server side. When the data are written back, theserver>is told that the writes have already been acknowledged to the application >(by >checking if OBD_BRW_FROM_GRANT is set) and thus accepts the write request >even >if the user is over quota. The server mentions in the bulk reply that the >user >is over quota and the client is then supposed to stop caching dirty data >(until >the server reports that the user is no longer over quota). The problem is >that >those quota overruns can be really significant since it depends on the >number of >clients: >max_quota_overruns = number of OSTs * number of clients * max_dirty_mb >e.g. = 500 * 1,000 * 32 > = 16TB :( >For now, only OSTs are concerned by this problem, but we will have the same >problem with inodes when we have a metadata writeback cache. >Fortunately, not all applications can run into this problem, but this can >happen >(actually, it is quite easy to reproduce with IOR/1 file per task). >I''ve been thinking of 2 approaches to tackle this problem: >- introduce some quota knowledge on the client side and modify the grant >cache > to take into account the uid/gid/dataset. >- stop granting [0;EOF] locks when a user gets closer to the quota limitand> only grant locks covering a region which fits within the remaining quota >space. > I''m discussing this solution with Oleg atm. > >Cheers, >Johann > >------ End of Forwarded Message