Hello, We have discussed the implementation of new readahead in CLIO. Here I just send the design document out to ask for comments. We have already had several inputs from Z(aka bzzz) and other engineers. I''m not going to copy those ideas here directly, because I''m afraid to distort something, so please reply this email to show your ideas - thanks in advance. Jay -- Good good study, day day up -------------- next part -------------- A non-text attachment was scrubbed... Name: ra.tar Type: application/x-tar Size: 194560 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20100120/d07ce14d/attachment-0001.tar
UP!! Here is a pdf version of design document. Also I''m attaching a picture because the pictures in pdf is not clear. Thanks, Jay jay wrote:> Hello, > > We have discussed the implementation of new readahead in CLIO. Here I > just send the design document out to ask for comments. > > We have already had several inputs from Z(aka bzzz) and other > engineers. I''m not going to copy those ideas here directly, because > I''m afraid to distort something, so please reply this email to show > your ideas - thanks in advance. > > Jay > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-- Good good study, day day up -------------- next part -------------- A non-text attachment was scrubbed... Name: readahead.pdf Type: application/pdf Size: 198695 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20100122/9c461a12/attachment-0001.pdf -------------- next part -------------- A non-text attachment was scrubbed... Name: lazy-readahead-1.jpg Type: image/jpeg Size: 142721 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20100122/9c461a12/attachment-0001.jpg
Alexey Lyashkov
2010-Jan-23 07:09 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
>>We have an idea to spawn a per file readahead thread for each process, and this thread can be used to issue the readahead RPC async.>>I correctly understand: you suggest a spawn one new thread per open file? so if client have 10 processes, and each process is open 100 files, you need spawn 1000 new threads? On Fri, 2010-01-22 at 18:53 +0800, jay wrote:> UP!! > > Here is a pdf version of design document. Also I''m attaching a picture > because the pictures in pdf is not clear. > > Thanks, > Jay > > jay wrote: > > Hello,-- Alexey Lyashkov <alexey.lyashkov at clusterstor.com> ClusterStor
Alexey Lyashkov wrote:> We have an idea to spawn a per file readahead thread for each process, > and this thread can be used to issue the readahead RPC async. > > I correctly understand: you suggest a spawn one new thread per open > file? > so if client have 10 processes, and each process is open 100 files, you > need spawn 1000 new threads? >No, per process readahead, or some system readahead thread pool, this is because most of those threads are sleeping, and it consumes little time to issue readahead requests. The idea behind the scheme is to issue readahead rpcs async. BTW, I''m not going to implement what you mentioned in linux, because I don''t think this is a good idea, as what I said in design doc. However, we HAVE to have an async thread pool to implement readahead for windows. Windows doesn''t have an interface of issuing async read request, lack of a mechanism to have page lock or similar things - what a pity! Jay> > On Fri, 2010-01-22 at 18:53 +0800, jay wrote: > >> UP!! >> >> Here is a pdf version of design document. Also I''m attaching a picture >> because the pictures in pdf is not clear. >> >> Thanks, >> Jay >> >> jay wrote: >> >>> Hello, >>> > > >-- Good good study, day day up
Alexey Lyashkov
2010-Jan-24 09:18 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
On Sun, 2010-01-24 at 09:01 +0800, jay wrote:> Alexey Lyashkov wrote: > > We have an idea to spawn a per file readahead thread for each process, > > and this thread can be used to issue the readahead RPC async. > > > > I correctly understand: you suggest a spawn one new thread per open > > file? > > so if client have 10 processes, and each process is open 100 files, you > > need spawn 1000 new threads? > > > No, per process readahead, or some system readahead thread pool, this is > because most of those threads are sleeping, and it consumes little time > to issue readahead requests. The idea behind the scheme is to issue > readahead rpcs async.first case is same as i say (i think) - 10 processes reading from own files, so will be spawn 1000 new threads. in second case you will be lost readahead requests on hardloaded client.> > BTW, I''m not going to implement what you mentioned in linux, because I > don''t think this is a good idea, as what I said in design doc. However, > we HAVE to have an async thread pool to implement readahead for windows. > Windows doesn''t have an interface of issuing async read request, lack of > a mechanism to have page lock or similar things - what a pity!hm.. looks i don''t understand problem. Currently linux client is using ->readpage() to generate OST_READ RPC and sending via ptlrpcd-io. Why isn''t generate this RPC directly for Windows? Or you mean about update asynchronous update VM cache ? -- Alexey Lyashkov <alexey.lyashkov at clusterstor.com> ClusterStor
Nicolas Williams
2010-Jan-25 04:05 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote:> Alexey Lyashkov wrote: > > I correctly understand: you suggest a spawn one new thread per open > > file? > > so if client have 10 processes, and each process is open 100 files, you > > need spawn 1000 new threads? > > > No, per process readahead, or some system readahead thread pool, this is > because most of those threads are sleeping, and it consumes little time > to issue readahead requests. The idea behind the scheme is to issue > readahead rpcs async.Sleeping threads do consume memory resources, and context switches between them do add cache pressure. The read ahead work should all be async, in which case you need no more readahead threads than you have CPUs.> BTW, I''m not going to implement what you mentioned in linux, because I > don''t think this is a good idea, as what I said in design doc. However, > we HAVE to have an async thread pool to implement readahead for windows. > Windows doesn''t have an interface of issuing async read request, lack of > a mechanism to have page lock or similar things - what a pity!But surely you can still do the readaheads asynchronously. Say you think that block N of some file will be needed soon: so you issue the read ahead of time. You''ll need to place the data somewhere, and hopefully that will be somewhere that the host OS''s VFS sub-system (Windows in your case) can either provide or accept -- if not you''ll need to do a copy later, but you''re still able to send the read request, and process the reply, asynchronously. Nico --
Nico and shadow, Since you have the same question about windows, I just replied them in one email. Also I got Matt involved - he is a windows expert. Alexey Lyashkov wrote:> On Sun, 2010-01-24 at 09:01 +0800, jay wrote: > >> Alexey Lyashkov wrote: >> >>> We have an idea to spawn a per file readahead thread for each process, >>> and this thread can be used to issue the readahead RPC async. >>> >>> I correctly understand: you suggest a spawn one new thread per open >>> file? >>> so if client have 10 processes, and each process is open 100 files, you >>> need spawn 1000 new threads? >>> >>> >> No, per process readahead, or some system readahead thread pool, this is >> because most of those threads are sleeping, and it consumes little time >> to issue readahead requests. The idea behind the scheme is to issue >> readahead rpcs async. >> > first case is same as i say (i think) - 10 processes reading from own > files, so will be spawn 1000 new threads. > in second case you will be lost readahead requests on hardloaded client. >Nod - that''s why I''m not going to do it in linux - as the design doc said - don''t you see section 8.1? :-)> >> BTW, I''m not going to implement what you mentioned in linux, because I >> don''t think this is a good idea, as what I said in design doc. However, >> we HAVE to have an async thread pool to implement readahead for windows. >> Windows doesn''t have an interface of issuing async read request, lack of >> a mechanism to have page lock or similar things - what a pity! >> > hm.. looks i don''t understand problem. Currently linux client is using > ->readpage() to generate OST_READ RPC and sending via ptlrpcd-io. > Why isn''t generate this RPC directly for Windows? Or you mean about > update asynchronous update VM cache ? >The problem is that we have to wait for the RPC(which may just contain readahead pages) to be finished before we can return to user space. You may ask why we can''t do this, the answer is that we should pipeline the readahead request, instead of reading a chunk of data to support readahead. The problem of windows is that it''s lack of interfaces to manipulate pages. I''m not a windows expert - please ask matt if you have windows specific questions.> >-- Good good study, day day up
Matt Wu
2010-Jan-25 06:55 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
We need do readahead asynchronously, but Windows kernel doesn''t give us an easy solution. Here are the issues for Windows readahead: 1, Windows kenrel (VM) doesn''t provide kernel drivers an equivalent grab_cache_page_nowait_gfp() to allocate an empty/invalid page. So in ll_readpage(), it''s too late for WNC to grab more pages for readahead. 2, The routines provided by Windows kernel to allocate page cache are synchronous and they won''t return until the requested pages are fetched. So we plan to start a thread pool, and dispatch the readahead requests to these threads instead of blocking user thread. We can group the threads by several ways: 1, request per random thread, without any specify order. we just start a fixed number of threads and queue the readahead request to any thread of the thread pool. this is the decision we made during WNC readahead meeting last week. 2, thread per file (file) or thread per open instance (fd) 3, thread per ost, we need divide the readahead request to several which are stripe boundary aligned. regards, matt On 2010/1/25 12:05, Nicolas Williams wrote:> On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote: >> Alexey Lyashkov wrote: >>> I correctly understand: you suggest a spawn one new thread per open >>> file? >>> so if client have 10 processes, and each process is open 100 files, you >>> need spawn 1000 new threads? >>> >> No, per process readahead, or some system readahead thread pool, this is >> because most of those threads are sleeping, and it consumes little time >> to issue readahead requests. The idea behind the scheme is to issue >> readahead rpcs async. > > Sleeping threads do consume memory resources, and context switches > between them do add cache pressure. The read ahead work should all be > async, in which case you need no more readahead threads than you have > CPUs. > >> BTW, I''m not going to implement what you mentioned in linux, because I >> don''t think this is a good idea, as what I said in design doc. However, >> we HAVE to have an async thread pool to implement readahead for windows. >> Windows doesn''t have an interface of issuing async read request, lack of >> a mechanism to have page lock or similar things - what a pity! > > But surely you can still do the readaheads asynchronously. Say you > think that block N of some file will be needed soon: so you issue the > read ahead of time. You''ll need to place the data somewhere, and > hopefully that will be somewhere that the host OS''s VFS sub-system > (Windows in your case) can either provide or accept -- if not you''ll > need to do a copy later, but you''re still able to send the read request, > and process the reply, asynchronously. > > Nico
Andreas Dilger
2010-Jan-25 07:23 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
On 2010-01-24, at 23:55, Matt Wu wrote:> We can group the threads by several ways: > 1, request per random thread, without any specify order. we just > start a > fixed number of threads and queue the readahead request to any > thread of > the thread pool. > this is the decision we made during WNC readahead meeting last > week. > 2, thread per file (file) or thread per open instance (fd) > 3, thread per ost, we need divide the readahead request to several > which > are stripe boundary aligned.In order to keep the readahead pages local to the NUMA node that the userspace thread is running on, I''d recommend at most a single readahead thread per core. That way, when the readahead thread is allocating pages they will be on the right NUMA node.> On 2010/1/25 12:05, Nicolas Williams wrote: >> On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote: >>> Alexey Lyashkov wrote: >>>> I correctly understand: you suggest a spawn one new thread per open >>>> file? >>>> so if client have 10 processes, and each process is open 100 >>>> files, you >>>> need spawn 1000 new threads? >>>> >>> No, per process readahead, or some system readahead thread pool, >>> this is >>> because most of those threads are sleeping, and it consumes little >>> time >>> to issue readahead requests. The idea behind the scheme is to issue >>> readahead rpcs async. >> >> Sleeping threads do consume memory resources, and context switches >> between them do add cache pressure. The read ahead work should all >> be >> async, in which case you need no more readahead threads than you have >> CPUs. >> >>> BTW, I''m not going to implement what you mentioned in linux, >>> because I >>> don''t think this is a good idea, as what I said in design doc. >>> However, >>> we HAVE to have an async thread pool to implement readahead for >>> windows. >>> Windows doesn''t have an interface of issuing async read request, >>> lack of >>> a mechanism to have page lock or similar things - what a pity! >> >> But surely you can still do the readaheads asynchronously. Say you >> think that block N of some file will be needed soon: so you issue the >> read ahead of time. You''ll need to place the data somewhere, and >> hopefully that will be somewhere that the host OS''s VFS sub-system >> (Windows in your case) can either provide or accept -- if not you''ll >> need to do a copy later, but you''re still able to send the read >> request, >> and process the reply, asynchronously. >> >> Nico > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nicolas Williams
2010-Jan-25 15:34 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
On Mon, Jan 25, 2010 at 12:23:03AM -0700, Andreas Dilger wrote:> On 2010-01-24, at 23:55, Matt Wu wrote: > >We can group the threads by several ways: > >1, request per random thread, without any specify order. we just > >start a fixed number of threads and queue the readahead request to > >any thread of the thread pool. this is the decision we made during > >WNC readahead meeting last week. > >2, thread per file (file) or thread per open instance (fd) > >3, thread per ost, we need divide the readahead request to several > >which are stripe boundary aligned. > > In order to keep the readahead pages local to the NUMA node that the > userspace thread is running on, I''d recommend at most a single > readahead thread per core. That way, when the readahead thread is > allocating pages they will be on the right NUMA node.That was my recommendation as well, but if I understand Matt correctly, the Windows VFS makes it impossible to do readahead asynchronously, which is why Matt suggests having many threads. I have no clue as to the relevant Windows kernel APIs, but if Matt''s right about Windows, then color me surprised. Assuming that''s correct and that there''s no reasonable way around the problem, then I''d recommend having a pool with some number of threads (say, 3 * CPUs), with readaheads done only when there are threads available in the pool. Nico --
Alex Zhuravlev
2010-Jan-26 10:02 UTC
[Lustre-devel] proposal on implementing a new readahead in clio
Hi, I think we could help a lot if you restructure the proposal a bit: first of all, describe the algorithm w/o implementation details, probably using notion of event: read event extending window, io to get data ahead, read-ahead io completion, hit/miss, etc. then map these events to specific code paths? explain what kind of information/mechanism layers miss to implement the algorithm? z. On 1/20/10 3:37 PM, jay wrote:> Hello, > > We have discussed the implementation of new readahead in CLIO. Here I > just send the design document out to ask for comments. > > We have already had several inputs from Z(aka bzzz) and other engineers. > I''m not going to copy those ideas here directly, because I''m afraid to > distort something, so please reply this email to show your ideas - > thanks in advance. > > Jay > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel