In conversations with Andreas, he suggested I send some of this to a wider audience. Everyone knows by now I''m working on the MacOS X port, right? Okay, good. As part of that work, I''ve been working on attribute caching. Right now I''m hitting a bit of a snag when it comes to caching the file size; unlike all of the metadata attributes, this requires getting glimpse locks on the OSTs. And this starts bringing up the large outstanding issue of the clio layer. Specifically, clio has ... portability problems. The big one that crops up right away is the general assumption that OS memory pages are indexed via a radix tree. This sort of also ties into the way memory management is handled (I''m not aware of any OS, other than Linux, that provides a callback that the OS can use if they are experiencing memory pressure to tell kernel modules to give up pages). Obviously these aren''t new issues. For the first release of the MacOS port, I didn''t use clio at all. But now it''s coming up and I''m wondering ... since Lustre is now being ported to other OSes, what did everyone else do about these issues? --Ken
Hello, On 22 July 2010 19:10, Ken Hornstein <kenh at cmf.nrl.navy.mil> wrote:> In conversations with Andreas, he suggested I send some of this to a wider > audience. > > Everyone knows by now I''m working on the MacOS X port, right? ?Okay, good. > > As part of that work, I''ve been working on attribute caching. ?Right now > I''m hitting a bit of a snag when it comes to caching the file size; unlike > all of the metadata attributes, this requires getting glimpse locks on > the OSTs. ?And this starts bringing up the large outstanding issue of > the clio layer. > > Specifically, clio has ... portability problems. ?The big one that > crops up right away is the general assumption that OS memory pages are > indexed via a radix tree. ?This sort of also ties into the way memoryI am not sure I understand this one. CLIO has it own radix tree to index cached pages precisely so that it doesn''t depend on whatever indexing mechanism the host kernel uses (see cl_page_find0()). There are two implementations of a radix-tree in the public Lustre code: one using Linux kernel lib/radix-tree.c and another for user space in libcfs/posix/libcfs.h. I assume Solaris port has another one. The general assumption of CLIO data page caching is that pages (in files and stripe objects) are indexed by their linear logical offset (page index), that''s all. Well, until you look at the direct-IO code paths too closely. :-)> management is handled (I''m not aware of any OS, other than Linux, that > provides a callback that the OS can use if they are experiencing memory > pressure to tell kernel modules to give up pages).I think any modern UNIX version does this. BSDs and Solaris have PUT_PAGE() vnop. When I was porting Lustre to OS X (Panther times) there used to be two ways to do this: at the UBC/VFS level and at the UPL/Mach level, I am not sure how things work nowadays. I am not sure how it is possible to have a kernel without such a call-back. When memory is low, the kernel has to free some pages, and to free a dirty page is has to be written back first, and only a corresponding file system knows how to do this, so some kind of a "clean page" call seems necessary?> > Obviously these aren''t new issues. ?For the first release of the MacOS port, > I didn''t use clio at all. ?But now it''s coming up and I''m wondering ... > since Lustre is now being ported to other OSes, what did everyone else > do about these issues? > > --KenThank you, Nikita.>
Sorry to follow up on myself... On 22 July 2010 19:43, Nikita Danilov <nikita.danilov at clusterstor.com> wrote:> Hello,[...]> > I think any modern UNIX version does this. BSDs and Solaris have > PUT_PAGE() vnop. When I was porting Lustre to OS X (Panther times) > there used to be two ways to do this: at the UBC/VFS level and at the > UPL/Mach level, I am not sure how things work nowadays. >I just checked xnu-1504.7.4/bsd/vm/vnode_pageout(). It seems to call VNOP_PAGEOUT(), which looks quite similar to Linux ->writepage(). It is possible for a Lustre client port to override this VNOP? Thank you, Nikita.> >> >> Obviously these aren''t new issues. ?For the first release of the MacOS port, >> I didn''t use clio at all. ?But now it''s coming up and I''m wondering ... >> since Lustre is now being ported to other OSes, what did everyone else >> do about these issues? >> >> --Ken > > Thank you, > Nikita. > >> >
>I am not sure I understand this one. CLIO has it own radix tree to >index cached pages precisely so that it doesn''t depend on whatever >indexing mechanism the host kernel uses (see cl_page_find0()). There >are two implementations of a radix-tree in the public Lustre code: one >using Linux kernel lib/radix-tree.c and another for user space in >libcfs/posix/libcfs.h. I assume Solaris port has another one.Weeeelll .... I can''t disagree with you, since you wrote that code, but from my viewpoint there are still a number of assumptions in the CLIO layer that seem (to my eye) very Linux-specific. E.g., the use of macros like PageWriteback, PageLocked, which from my extremely limited knowledge of MacOS X equivalents to those macros simply don''t exist. If CLIO is more portable than I first thought, hey, great ... I''ve spent a long time trying to trace through the CLIO layers and when I find it quickly diving into Linux VM functions (sadly, things like the VM code you did for the Panther port aren''t available), that worries and confuses me. Since the Linux port is really my only example, it''s hard to figure out the line between what the OS needs to provide versus what I can emulate on my own. I came to the conclusion that right now CLIO it was really really tied to the Linux VM system; if that''s wrong, I''ll freely own up to getting it wrong! :-)>The general assumption of CLIO data page caching is that pages (in >files and stripe objects) are indexed by their linear logical offset >(page index), that''s all. Well, until you look at the direct-IO code >paths too closely. :-)Now, here''s where we start running into termology issues: when you say "linear logical offset", do you mean from the beginning of each _file_, or something else? You never deal with anything called a "page index" on MacOS X that I''ve found, and the whole termology difference makes things more confusing.>> management is handled (I''m not aware of any OS, other than Linux, that >> provides a callback that the OS can use if they are experiencing memory >> pressure to tell kernel modules to give up pages). > >I think any modern UNIX version does this. BSDs and Solaris have >PUT_PAGE() vnop. When I was porting Lustre to OS X (Panther times) >there used to be two ways to do this: at the UBC/VFS level and at the >UPL/Mach level, I am not sure how things work nowadays.I was thinking more about the shrinker interface, but on closer inspection that''s not used by the CLIO layer; my apologies for saying otherwise. Of course you''re right about the VNOP_PAGEOUT call (which you mention in your second email); that''s the obvious choice. --Ken
On 22 July 2010 20:26, Ken Hornstein <kenh at cmf.nrl.navy.mil> wrote:>>I am not sure I understand this one. CLIO has it own radix tree to >>index cached pages precisely so that it doesn''t depend on whatever >>indexing mechanism the host kernel uses (see cl_page_find0()). There >>are two implementations of a radix-tree in the public Lustre code: one >>using Linux kernel lib/radix-tree.c and another for user space in >>libcfs/posix/libcfs.h. I assume Solaris port has another one. > > Weeeelll .... I can''t disagree with you, since you wrote that code, but > from my viewpoint there are still a number of assumptions in the CLIO > layer that seem (to my eye) very Linux-specific. ?E.g., the use of > macros like PageWriteback, PageLocked, which from my extremely limited > knowledge of MacOS X equivalents to those macros simply don''t exist.CLIO VM code (on a particular platform) consists of two parts: generic code (in obdclass/cl_page.c) and a platform specific code. In the case of Linux this platform specific code is in llite/vvp_page.c. The idea is that there is an interface to struct cl_page, defined in cl_object.h. vvp_page.c provides a particular implementation of this interface. The rest of the CLIO interacts with VM through this interface. For example, Page_Writeback() is used only in vvp_page.c to implement vvp_page_completion_write() (it is also used by a few assertions here and there, but all this can be removed). The latter function is installed as cl_page_operations::cpo_completion. When platform independent CLIO code needs to wait for IO completion it calls cl_page_completion() that calls corresponding function pointer cl_page_operations. To port a client to another platform (VM-wise), one has to implement all cl_page_operations for the platform. E.g., instead of using PageWriteback(), you would have to implement xnu_page_completion_write() via buf_biowait() or something. cl_object.h explains the intended semantics of cl_page_operations and cl_page.> If CLIO is more portable than I first thought, hey, great ... I''ve > spent a long time trying to trace through the CLIO layers and when I > find it quickly diving into Linux VM functions (sadly, things like the > VM code you did for the Panther port aren''t available), that worries > and confuses me. ?Since the Linux port is really my only example, it''sThere used to be a text document, explaining CLIO internals in some detail, but I cannot find it anywhere. Perhaps somebody from Oracle can help.> hard to figure out the line between what the OS needs to provide versus > what I can emulate on my own. ?I came to the conclusion that right now > CLIO it was really really tied to the Linux VM system; if that''s wrong, > I''ll freely own up to getting it wrong! :-)I _hope_ it is wrong, it was designed so that this should be wrong, but the only way to be really sure is to try it. :-)> >>The general assumption of CLIO data page caching is that pages (in >>files and stripe objects) are indexed by their linear logical offset >>(page index), that''s all. Well, until you look at the direct-IO code >>paths too closely. :-) > > Now, here''s where we start running into termology issues: when you say > "linear logical offset", do you mean from the beginning of each _file_, > or something else? ?You never deal with anything called a "page index" > on MacOS X that I''ve found, and the whole termology difference makes > things more confusing.(As an aside, terminology difficulties were almost unsurmountable when Lustre client structure was discussed with Windows developers, hopefully OSX is closer to Linux than descendants of RSX-NT lineage.) Yes, page index is a number a page has in a file that is divided in equally sized pages. One subtlety here is that a file (something that can be open(2)-ed) is striped over "stripe objects" and these also consist of pages. CLIO page is "layered": it binds together a file page and corresponding stripe object page.> >>> management is handled (I''m not aware of any OS, other than Linux, that >>> provides a callback that the OS can use if they are experiencing memory >>> pressure to tell kernel modules to give up pages). >> >>I think any modern UNIX version does this. BSDs and Solaris have >>PUT_PAGE() vnop. When I was porting Lustre to OS X (Panther times) >>there used to be two ways to do this: at the UBC/VFS level and at the >>UPL/Mach level, I am not sure how things work nowadays. > > I was thinking more about the shrinker interface, but on closer inspection > that''s not used by the CLIO layer; my apologies for saying otherwise. ?Of > course you''re right about the VNOP_PAGEOUT call (which you mention in your > second email); that''s the obvious choice. > > --Ken >Thank you, Nikita.
On Jul 22, 2010, at 21:19, Nikita Danilov wrote:>> If CLIO is more portable than I first thought, hey, great ... I''ve >> spent a long time trying to trace through the CLIO layers and when I >> find it quickly diving into Linux VM functions (sadly, things like the >> VM code you did for the Panther port aren''t available), that worries >> and confuses me. Since the Linux port is really my only example, it''s > > There used to be a text document, explaining CLIO internals in some > detail, but I cannot find it anywhere. Perhaps somebody from Oracle > can help.that document should be exist here http://wiki.lustre.org/index.php/Lustre_All-Hands_Meeting_12/08 but CLIO-TOI and CLIO documents have same data, So looks bug on wiki. -------------------------------------- Alexey Lyashkov alexey.lyashkov at clusterstor.com
On 22 July 2010 22:27, Alexey Lyashkov <alexey.lyashkov at clusterstor.com> wrote:> > > that document should be exist here > http://wiki.lustre.org/index.php/Lustre_All-Hands_Meeting_12/08 > butThis is a presentation. There used to be a text document ("clio-toi" or something) with much more detailed description. I don''t think it was ever published outside of Sun.> CLIO-TOI and CLIO documents have same data, > So looks bug on wiki.Nikita.
>Yes, page index is a number a page has in a file that is divided in >equally sized pages. One subtlety here is that a file (something that >can be open(2)-ed) is striped over "stripe objects" and these also >consist of pages. CLIO page is "layered": it binds together a file >page and corresponding stripe object page.Okay ... so, I take it the stripe objects are the reason for the "parent" argument to cl_page_find0()? I will confess that was always a puzzler. If anyone at Oracle could dig up that text document about CLIO that Nikita is referring to, that sure would be helpful! Although the presentation that Alexey referrs to certainly is useful. Thanks for everyone''s help! This has certainly been useful. --Ken
On 22 July 2010 23:07, Ken Hornstein <kenh at cmf.nrl.navy.mil> wrote:>>Yes, page index is a number a page has in a file that is divided in >>equally sized pages. One subtlety here is that a file (something that >>can be open(2)-ed) is striped over "stripe objects" and these also >>consist of pages. CLIO page is "layered": it binds together a file >>page and corresponding stripe object page. > > Okay ... so, I take it the stripe objects are the reason for the > "parent" argument to cl_page_find0()? ?I will confess that was always > a puzzler.Exactly.> > If anyone at Oracle could dig up that text document about CLIO that Nikita > is referring to, that sure would be helpful! ?Although the presentation that > Alexey referrs to certainly is useful. > > Thanks for everyone''s help! ?This has certainly been useful.Don''t hesitate to ask.> > --Ken >Nikita.
On 2010-07-22, at 12:38, Nikita Danilov wrote:> On 22 July 2010 22:27, Alexey Lyashkov <alexey.lyashkov at clusterstor.com> wrote: >> that document should be exist here >> http://wiki.lustre.org/index.php/Lustre_All-Hands_Meeting_12/08 >> but CLIO-TOI and CLIO documents have same data, so looks bug on wiki.I don''t see that they are the same. CLIO-TOI is 25 pages, CLIO is 19 pages, and while they share some slides there is a fair bit of different content.> This is a presentation. There used to be a text document ("clio-toi" > or something) with much more detailed description. I don''t think it > was ever published outside of Sun.Can you give me some more details on where that might be located? On the intra wiki I can only find "CLIO: Client I/O Cleanup project", and on the external wiki I can only find the above two documents. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 23 July 2010 01:18, Andreas Dilger <andreas.dilger at oracle.com> wrote:> > On 2010-07-22, at 12:38, Nikita Danilov wrote: >> On 22 July 2010 22:27, Alexey Lyashkov <alexey.lyashkov at clusterstor.com> wrote: >>> that document should be exist here >>> http://wiki.lustre.org/index.php/Lustre_All-Hands_Meeting_12/08 >>> but CLIO-TOI and CLIO documents have same data, so looks bug on wiki. > > I don''t see that they are the same. ?CLIO-TOI is 25 pages, CLIO is 19 pages, and while they share some slides there is a fair bit of different content. > >> This is a presentation. There used to be a text document ("clio-toi" >> or something) with much more detailed description. I don''t think it >> was ever published outside of Sun. > > Can you give me some more details on where that might be located? ?On the intra wiki I can only find "CLIO: Client I/O Cleanup project", and on the external wiki I can only find the above two documents.I remember sending it around as email when doing TOI on CLIO, I cannot remember whether it was added to wiki or somewhere else, sorry. The keywords to search for would be "cl_page", "cl_object", "cl_lock", etc.> > Cheers, AndreasNikita.
On 2010-07-22, at 15:57, Nikita Danilov wrote:> I remember sending it around as email when doing TOI on CLIO, I cannot > remember whether it was added to wiki or somewhere else, sorry. The > keywords to search for would be "cl_page", "cl_object", "cl_lock", > etc.Nikita, thanks for the pointers. We found the document and I''ve made it available on the wiki page as "CLIO-TOI-notes.pdf" http://wiki.lustre.org/index.php/Lustre_All-Hands_Meeting_12/08 for the meeting at which it was presented along with the other CLIO presentations: http://wiki.lustre.org/images/3/37/CLIO-TOI-notes.pdf Ken, hopefully this is of use to you. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
>http://wiki.lustre.org/images/3/37/CLIO-TOI-notes.pdf > >Ken, hopefully this is of use to you.Wow, YES! Nikita wasn''t kidding that this included tons of detail. This is a huge help! Thanks for taking the time to find this document! --Ken