Peter J. Braam
2004-Jun-02 23:15 UTC
[lustre-devel] RE: [PATCH/RFC] Lustre VFS patch, version 2
Hello! The feedback of the Lustre patches was of very high quality, thanks a lot for studying it carefully. Things are simpler now. Oleg Drokin and I discussed the emails extensively and here is our reply. We have attached another collection of patches, addressing many of the concerns. We felt it is was perhaps easier to keep this all in one long email. People requested to see the code that uses the patch. We have uploaded that to: ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz The client file system is the primary user of the kernel patch, in the llite directory. The MDS server is a sample user of do_kern_mount. As requested I have removed many other things from the tar ball to make review simple (so this won''t compile or run). 1. Export symbols concerns by Christoph Hellwig: Indeed we can do without __iget, kernel_text_address, reparent_to_init and exit_files. We actually need do_kern_mount and truncate_complete_page. Do kern mount is used because we use a file system namespace in servers in the kernel without exporting it to user space (mds/handler.c). The server file systems are ext3 file systems but we replace VFS locking with DLM locks, and it would take considerable work to export that as a file system. Truncate_complete_page is used to remove pages in the middle of a file mapping, when lock revocations happen (llite/file.c ll_extent_lock_callback, calling ll_pgcache_remove_extent) . 2. lustre_version.patch concerns by Christoph Hellwig: This one can easily be removed, but kernel version alone does not necessarily represent anything useful. There are tons of people patching their kernel with patches, even applying parts of newer kernel and still leaving kernel version at its old value (distributions immediately come to mind). So we still need something to identify version of necessary bits. E.g. version of intent API. 3. Introduction of lock-less version of d_rehash (__d_rehash) by Christoph Hellwig: In some places lustre needs to do several things to dentry''s with dcache lock held already, e.g. traverse alias dentries in inode to find one with same name and parent as the one we have already. Lustre can invalidate busy dentries, which we put on a list. If these are looked up again, concurrently, we find them on this list and re-use them, to avoid having several identical aliases in an inode. See llite/{dcache.c,namei.c} ll_revalidate and the lock callback function ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to manipulate dentries associated with raw inodes and names in ext3. 4. vfs intent API changes kernel exported concern API by Christoph Hellwig: With slight modification it is possible to reduce the changes to just changes in the name of intent structure itself and some of its fields. This renaming was requested by Linus, but we can change names back easily if needed, that would avoid any api change. Are there other users, please let us know what to do? All the functions can easily be split into valid intent expecting ones (with some suffix in name like _it) and those that are part of old API would just initialise the intent to something sensible and then call corresponding intent-expecting function. No harm should be done to external filesystems this way. We have modified vfs intent API patch to achieve this. 5. Some objections from Trond Myklebust about open flags in exec, cwd revalidation, and revalidate_counter patch: We have fixed the exec open flags issue (our error). Also revalidate_counter patch was dropped since we can do this inside lustre as well. CWD revalidation can be converted to FS_REVAL_DOT in fs flags instead, but we still need part of that patch, the LOOKUP_LAST/LOOKUP_NOT_LAST part. Lustre needs to know when we reached the last component in the path so that intent needs to be looked at. (It seems we cannot use LOOKUP_CONTINUE for this reliably). 6. from Trond Myklebust:> The vfs-intent_lustre-vanilla-2.6.patch + the "intent_release()" > code. What if you end up crossing a mountpoint? How do you then know > to which superblock/filesystem the private field belongs if there are > more than one user of this mechanism?Basically intent only makes sence for the last component. Our code checks that and if we are doing lookup a component before the last, then a dummy IT_LOOKUP intent is created on stack and we work with that, perhaps the same is true for other filesystems that would like to use this mechanism. 7. raw operations concerns by various people: We have now implemented an alternative approach to this, that is taking place when parent lookup is done, using intents. For setattr we managed to remove the raw operations alltogether, (praying that we haven''t forgotten some awful problem we solved that led to the introduction of setattr_raw in the first place). The correctly filled intent is recognised by filesystem''s lookup or revalidate method. After the parent is looked up, based on the intent the correct "raw" server call is executed, within the file system. Then a special flag is set in intent, the caller of parent lookup checks for the flag and if it is set, the functions returns immediately with supplied (in intent)exit code, without instantiating child dentries. This needs some minor changes to VFS, though. There are at least two approaches. One is to not introduce any new methods and just rely on fs'' metohds to do everything, for this to work filesystem needs to know the remaining path to be traversed (we can fill nd->last with remaining path before calling into fs). In the root directory of the mount, we need to call a revalidate (if supported by fs) on mountpoint to intercept the intent, after we crossed mountpoint. We have this approach implemented in that attached patch. Does it look better than the raw operations? Much simpler for us is to add additional inode operation "process_intent" method that would be called when LOOKUP_PARENT sort of lookup was requested and we are about to leave link_path_walk() with nameidata structure filled and everything ready. Then the same flag in intent will be set and everything else as in previous approach. We believe both methods are less intrusive than the raw methods, but slightly more delicate. 8. Mountpoint-crossing issues during rename (and link) noticed by Arjan van de Ven: Well, indeed this can happen if source or destination is a mountpoint on client but not server, this needs to be addressed by individual filesystems that chose to implement those raw methods. 9. dev_readonly patch concerns by Jens Axboe: We already clarified why we need it in this exact way. But there were some valid suggestions to use other means like dm-flakey device mapper module, so we decided to write a failure simulator DM. 10. "Have these patches undergone any siginifant test?" by Anton Blanchard: There are two important questions I think: - Do the patches cause damage? Probably not anymore. SUSE has done testing and it appears the original patch I attached didn''t break things (after one fix was made). - Is Lustre stable? On 2.4 Lustre is quite stable. On 2.6 we have done testing but, for example, never more than on 40 nodes. We don''t consider it rock solid on 2.6, it does pass POSIX and just about every other benchmark without failures. Since the patches were modified for this discussion there are of course some new issues which Oleg Drokin is now ironing out. Our test results are visible at https://buffalo.lustre.org Well, how close are we now to this being acceptable? - Peter J. Braam & Oleg Drokin -
Christoph Hellwig
2004-Jun-03 13:59 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
On Wed, Jun 02, 2004 at 05:15:27PM -0600, Peter J. Braam wrote:> People requested to see the code that uses the patch. We have uploaded that > to: > > ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz > > The client file system is the primary user of the kernel patch, in the > llite directory. The MDS server is a sample user of do_kern_mount. As > requested I have removed many other things from the tar ball to make > review simple (so this won''t compile or run).> We actually need do_kern_mount and truncate_complete_page. Do kern > mount is used because we use a file system namespace in servers in the > kernel without exporting it to user space (mds/handler.c). The server > file systems are ext3 file systems but we replace VFS locking with DLM > locks, and it would take considerable work to export that as a file > system.Yikes. I''d rather not see something like this going in, and better work on properly integrating the MDS code with the filesystem. There''s also lots of duplication or almost duplication of VFS functionality in that directory and the fsfilter horrors. I''d suggest you get that cleaned up and we''ll try to merge it into 2.7, okay?> Truncate_complete_page is used to remove pages in the middle of a file > mapping, when lock revocations happen (llite/file.c > ll_extent_lock_callback, calling ll_pgcache_remove_extent) .Most of ll_pgcache_remove_extent probably wants to be a proper VFS function. Again, only interesting if the rest of lustre gets merged.> > 2. lustre_version.patch concerns by Christoph Hellwig: > > This one can easily be removed, but kernel version alone does not > necessarily represent anything useful. There are tons of people > patching their kernel with patches, even applying parts of newer > kernel and still leaving kernel version at its old value > (distributions immediately come to mind). So we still need something > to identify version of necessary bits. E.g. version of intent API.Well, bad luck for you. It''s not like there much interest to merge any of these patches into the tree without the actual users anyway..> 3. Introduction of lock-less version of d_rehash (__d_rehash) by > Christoph Hellwig: > > In some places lustre needs to do several things to dentry''s with > dcache lock held already, e.g. traverse alias dentries in inode to > find one with same name and parent as the one we have already. Lustre > can invalidate busy dentries, which we put on a list. If these are > looked up again, concurrently, we find them on this list and re-use > them, to avoid having several identical aliases in an inode. See > llite/{dcache.c,namei.c} ll_revalidate and the lock callback function > ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to > manipulate dentries associated with raw inodes and names in ext3.I''ve only taken a short look at the dcache operations you''re doing and it looks a little fishy and very senistive for small changes in internal dcache semantics. You''re also missing e.g. the LSM callbacks it seems. Have you talked to Al about that code?> 4. vfs intent API changes kernel exported concern API by Christoph > Hellwig: > > With slight modification it is possible to reduce the changes to just > changes in the name of intent structure itself and some of its > fields. > > This renaming was requested by Linus, but we can change names back > easily if needed, that would avoid any api change. Are there other > users, please let us know what to do?Again, you''re changing a filesystem API, we have a bunch of intree users that can be modular so it''s likely there are out of tree users, too. The new semantics might be much nicer, but it''s 2.7 material.> 7. raw operations concerns by various people: > > We have now implemented an alternative approach to this, that is > taking place when parent lookup is done, using intents. For setattr > we managed to remove the raw operations alltogether, (praying that we > haven''t forgotten some awful problem we solved that led to the > introduction of setattr_raw in the first place). > > The correctly filled intent is recognised by filesystem''s lookup or > revalidate method. After the parent is looked up, based on the intent > the correct "raw" server call is executed, within the file > system. Then a special flag is set in intent, the caller of parent > lookup checks for the flag and if it is set, the functions returns > immediately with supplied (in intent)exit code, without instantiating > child dentries. > > This needs some minor changes to VFS, though. There are at > least two approaches. > > One is to not introduce any new methods and just rely on fs'' metohds > to do everything, for this to work filesystem needs to know the > remaining path to be traversed (we can fill nd->last with remaining > path before calling into fs). In the root directory of the mount, we > need to call a revalidate (if supported by fs) on mountpoint to > intercept the intent, after we crossed mountpoint. We have this > approach implemented in that attached patch. Does it look better than > the raw operations?I''m not sure whether overloading ->d_revalidate or a new method for that is prefferable.> Much simpler for us is to add additional inode operation > "process_intent" method that would be called when LOOKUP_PARENT sort > of lookup was requested and we are about to leave link_path_walk() > with nameidata structure filled and everything ready. Then the same > flag in intent will be set and everything else as in previous > approach.Yupp, that sounds better.> Well, how close are we now to this being acceptable?As already mentioned above they''re completely uninteresting without actually getting the user in tree _and_ maintained there (unlike e.g. intermezzo or coda that are creeping along). I think based on those patch we should be able to properly integrate intermezzo once 2.7 opens.
Lars Marowsky-Bree
2004-Jun-03 14:19 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
On 2004-06-03T14:59:52, Christoph Hellwig <hch@infradead.org> said:> > Well, how close are we now to this being acceptable? > As already mentioned above they''re completely uninteresting without > actually getting the user in tree _and_ maintained there (unlike e.g. > intermezzo or coda that are creeping along). I think based on those > patch we should be able to properly integrate intermezzo once 2.7 opens.This is something I''ve got to disagree with. First, Inter-mezzo is reasonably dead, from what I can see. As is Coda. You''ll notice that the developers behind them have sort-of moved on to Lustre ;-) The hooks (once cleaned up, no disagreement here, the technical feedback so far has been very valuable and continues to be) are useful and in effect needed not just for Lustre, but in principle for all cluster filesystems, such as (Open)GFS and others, even potentially NFS4 et al. The logic that _all_ modules and functionality need to be "in the tree" right from the start for hooks to be useful is flawed, I''m afraid. Pure horror that a proprietary cluster file system might also profit from it is not, exactly, a sound technical argument. (I can assure you I don''t care at all for the proprietary cluster-fs.) Lustre alone would be, roughly, ~10MB more sources, just in the kernel. I don''t think you want to merge that right now, as desireable as it is on the other hand to be able to use it with a mainstream kernel. I think this is why kbuild allows external modules to be build; with that logic it would follow that this should be disabled too. There certainly is an interest in merging these (cleaned up) extensions and allowing powerful cluster filesystems to exist on Linux. Another example of this is the cache invalidation hook which we went through a few weeks ago too. Back then you complained about not having an Open Source user (because it was requested by IBM GPFS), and so GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source user... Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett
Christoph Hellwig
2004-Jun-03 14:26 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:> On 2004-06-03T14:59:52, > Christoph Hellwig <hch@infradead.org> said: > > > > Well, how close are we now to this being acceptable? > > As already mentioned above they''re completely uninteresting without > > actually getting the user in tree _and_ maintained there (unlike e.g. > > intermezzo or coda that are creeping along). I think based on those > > patch we should be able to properly integrate intermezzo once 2.7 opens. > > This is something I''ve got to disagree with. > > First, Inter-mezzo is reasonably dead, from what I can see. As is Coda. > You''ll notice that the developers behind them have sort-of moved on to > Lustre ;-)Arggg, sorry. Typo there. It should have of course read "I think based on those patches we should be able to properly integrate LUSTRE once 2.7 opens" .oO(/me looks for a brown paperbag to hide)> The logic that _all_ modules and functionality need to be "in the tree" > right from the start for hooks to be useful is flawed, I''m afraid. Pure > horror that a proprietary cluster file system might also profit from it > is not, exactly, a sound technical argument. (I can assure you I don''t > care at all for the proprietary cluster-fs.)It''s more about maintaince overhead. Maintaining features without the user direct at hand isn''t going anywhere. Especially when messing around deeply in the VFS. By your argumentation we should also throw in all the mosix and openssi hooks because they could be possibly useful, no? ;-)> Another example of this is the cache invalidation hook which we went > through a few weeks ago too. Back then you complained about not having > an Open Source user (because it was requested by IBM GPFS), and so > GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source > user...I was always arguing against the lack of an intree user mostly. Lack of something that could we could merge even in the future is even worse.
Trond Myklebust
2004-Jun-03 14:49 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
P=E5 to , 03/06/2004 klokka 07:19, skreiv Lars Marowsky-Bree:> The hooks (once cleaned up, no disagreement here, the technical feedback > so far has been very valuable and continues to be) are useful and in > effect needed not just for Lustre, but in principle for all cluster > filesystems, such as (Open)GFS and others, even potentially NFS4 et al. >=20 > The logic that _all_ modules and functionality need to be "in the tree" > right from the start for hooks to be useful is flawed, I''m afraid. Pure > horror that a proprietary cluster file system might also profit from it > is not, exactly, a sound technical argument. (I can assure you I don''t > care at all for the proprietary cluster-fs.)Whereas I agree that NFSv4 could use some of this (I''m mainly interested in the intent_release() stuff in order to fix up an existing race), I also agree with Christoph on the principle that having in-tree users right from the start should be the norm rather than the exception. Otherwise, exactly what is the plan for how to determine when an interface is obsolete? Are we going to rely on all the out-of-tree vendors to collectively step up and say "by the way - we''re not using this anymore."? Cheers, Trond
Daniel Phillips
2004-Jun-04 05:03 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
On Thursday 03 June 2004 10:19, Lars Marowsky-Bree wrote:> The hooks (once cleaned up, no disagreement here, the technical feedback > so far has been very valuable and continues to be) are useful and in > effect needed not just for Lustre, but in principle for all cluster > filesystems, such as (Open)GFS and others, even potentially NFS4 et al.GFS is now down to needing two trivial patches: 1) export sync_inodes_sb 2) provide a filesystem hook for flock Since GFS functions well without any of the current batch of proposed vfs hooks, the word "needed" is not appropriate. Maybe there is something in here that could benefit GFS, most probably in the intents department, but we certainly do want to try it first before pronouncing on that. The raw_ops seem to be entirely irrelevant to GFS, which is peer-to-pear, so does not delegate anything to a server. I don''t think we have a use for lookup_last. There are quite possibly some helpful ideas in the dcache tweaks but the devil is in the details: again we need to try it. Such things as: +#define DCACHE_LUSTRE_INVALID 0x0020 /* invalidated by Lustre */ clearly fail the "needed not just for Lustre" test. Looking into my crystal ball, I see many further revisions of this patch set. Unfortunately, in the latest revision we lost the patch-by-patch discussion, which seems to have been replaced by list of issues sorted by complainant. That''s interesting, but it''s no substitute. Regards, Daniel
Anton Blanchard
2004-Jun-04 16:55 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
> 10. "Have these patches undergone any siginifant test?" by Anton Blanchard: > > There are two important questions I think: > - Do the patches cause damage? > Probably not anymore. SUSE has done testing and it appears the > original patch I attached didn''t break things (after one fix was > made).IBM did a lot of the work on that issue and it took the better part of a week to find, fix and verify. Anton
Dipankar Sarma
2004-Jun-07 18:02 UTC
[lustre-devel] Re: [PATCH/RFC] Lustre VFS patch, version 2
On Sat, Jun 05, 2004 at 02:55:48AM +1000, Anton Blanchard wrote:> > > 10. "Have these patches undergone any siginifant test?" by Anton Blanchard: > > > > There are two important questions I think: > > - Do the patches cause damage? > > Probably not anymore. SUSE has done testing and it appears the > > original patch I attached didn''t break things (after one fix was > > made). > > IBM did a lot of the work on that issue and it took the better part of a > week to find, fix and verify.AFAIK, Maneesh asked about revalidate_special() returning negative dentries and no checking of it in path lookup(), but got no reply from Lustre folks. It is clearly broken. Maneesh has more breakage from Lustre VFS patches now. It would be helpful if they atleast comment on fixes for those. Thanks Dipankar