Can anyone tell me if read ahead in Lustre includes "early return" features. I mean that if I read 4K and readahead decides to fetch 1M will my request get serviced when the first 4K arrives? Is this important? - Peter -
Peter Braam writes: > Can anyone tell me if read ahead in Lustre includes "early return" > features. I mean that if I read 4K and readahead decides to fetch 1M > will my request get serviced when the first 4K arrives? Is this important? Currently read system call will proceed when the first RPC (including first 4K page and some number of read-ahead pages) is serviced: generic_file_read() waits on a page lock, and lock is released by completion routine (ll_ap_completion()). The problem with the "early return" (if I understood it correctly) is that large read-ahead window is an indication of sequential access, and in that case read is going to wait on the next page anyway. > > - Peter - Nikita.
Hello! On Dec 11, 2007, at 1:25 PM, Peter Braam wrote:> Can anyone tell me if read ahead in Lustre includes "early return" > features. I mean that if I read 4K and readahead decides to fetch > 1M will my request get serviced when the first 4K arrives? Is this > important?I think this is impossible to implement with current architecture. We have one bulk RPC (1M in size) that until received completely, won''t issue any callbacks. So only when that entire 1M is received your 4k request would return. On the other hand if your example is 4k and 2M, then we will return after 1M that contains requested 4k is received (but there is no guarantee at the moment we won''t receive second 2M first, I believe). Bye, Oleg
This might be quite damaging in some situations - for example, if the server has the 4K data cached in RAM it should refuse to do a disk read probably, but in order to do so it would need to know that part of the request is optional, while the 4K is mandatory. Can we give hints to the OSC about what part of I/O is requested by applications and what is requested for read-ahead? If so, could we use a more interesting IOV to do this faster? - Peter - Oleg Drokin wrote:> Hello! > > On Dec 11, 2007, at 1:25 PM, Peter Braam wrote: > >> Can anyone tell me if read ahead in Lustre includes "early return" >> features. I mean that if I read 4K and readahead decides to fetch 1M >> will my request get serviced when the first 4K arrives? Is this >> important? > > I think this is impossible to implement with current architecture. > We have one bulk RPC (1M in size) that until received completely, > won''t issue any callbacks. > So only when that entire 1M is received your 4k request would return. > On the other hand if your example is 4k and 2M, then we will return > after 1M that contains requested 4k is received (but there is no > guarantee at the moment we won''t receive second 2M first, I believe). > > Bye, > Oleg
Hello! Unfortunately, currently osc has no idea about what was original read request. Original request size is only known in ll_file_read, that only gets proper lock. Then we jump into generic_file_read that calls ll_readpage for every page that needs to be read. ll_readpage has no idea how many more pages are going to be read in this request if any more, so we just try to stuff as much as we can into RPC (within our redahead window). Actually, now that I look into it, there is special readahead structure filled that tells how big this read reqest is, so ll_readahed can adjust the window size for the entire read request to fit in. So it seems it is possible to see what pages are readahead and what are from original request at ll_readahead level and we can pass that info down to osc as some sort of flag if needed. But we do not (yet?) have any caching on OST aside from device cache and we have no way to know what''s in device cache too. I am not sure what do you mean by more interesting iov. Bye, Oleg On Dec 11, 2007, at 1:59 PM, Peter Braam wrote:> This might be quite damaging in some situations - for example, if > the server has the 4K data cached in RAM it should refuse to do a > disk read probably, but in order to do so it would need to know that > part of the request is optional, while the 4K is mandatory. > > Can we give hints to the OSC about what part of I/O is requested by > applications and what is requested for read-ahead? If so, could we > use a more interesting IOV to do this faster? > > - Peter - > > Oleg Drokin wrote: >> Hello! >> >> On Dec 11, 2007, at 1:25 PM, Peter Braam wrote: >> >>> Can anyone tell me if read ahead in Lustre includes "early return" >>> features. I mean that if I read 4K and readahead decides to fetch >>> 1M will my request get serviced when the first 4K arrives? Is >>> this important? >> >> I think this is impossible to implement with current architecture. >> We have one bulk RPC (1M in size) that until received completely, >> won''t issue any callbacks. >> So only when that entire 1M is received your 4k request would return. >> On the other hand if your example is 4k and 2M, then we will return >> after 1M that contains requested 4k is received (but there is no >> guarantee at the moment we won''t receive second 2M first, I believe). >> >> Bye, >> Oleg
On Dec 11, 2007 21:42 +0300, Nikita Danilov wrote:> Peter Braam writes: > > Can anyone tell me if read ahead in Lustre includes "early return" > > features. I mean that if I read 4K and readahead decides to fetch 1M > > will my request get serviced when the first 4K arrives? Is this important? > > Currently read system call will proceed when the first RPC (including > first 4K page and some number of read-ahead pages) is serviced: > generic_file_read() waits on a page lock, and lock is released by > completion routine (ll_ap_completion()).Another thing worth mentioning here is that if this is the FIRST 4kB read from the file, then only that 4kB will be returned in the RPC, because readahead hasn''t done linear vs. random IO detection yet. If it is the second read (and linear) then the client will get the _rest_ of the 1MB and will have to wait for that second RPC to complete. For subsequent reads the readahead will of course prefetch the pages. For random reads the code does understand the difference between e.g. reads of 16 sequential pages (64kB generally) read at non-consecutive offsets and 16 sequential 4kB page reads. The former will NOT start readahead, while the latter does. Two areas where our readahead is lacking are: - strided reads (may turn the above 16 x 4kB reads into a situation where the client will prefetch pages instead of "random" IO, depending on access pattern, and will avoid prefetch of data the client is not expecting to use) - limiting the readahead to the rate that the client is actually consuming it (currently once we detect sequential reads the readahead window grows eventually to the maximum even if this is far more than what the client needs) Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
I thought it was the 3rd read that triggers readahead. When I track network and lustre I/O while doing 4, 8 and 12K random reads look at the network traffic: #<-----------Network----------><-------Lustre Client------> #netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite 0 9 1 15 1 4 0 0 18 149 16 165 0 0 0 0 57 626 56 639 0 0 0 0 34 382 34 392 2 8 0 0 12 30 5 40 0 0 0 0 0 10 1 23 0 0 0 0 0 8 1 24 0 0 0 0 1 20 1 23 3 12 0 0 1087 758 32 422 0 0 0 0 Since this is a shared network, you''re seeing ''noise'' on the link on the order of about 10-50MB/sec, but the spike of over 1M is clearly a result due to the readahead. I had also done earlier byte level tests in which reading 8192 bytes didn''t do readahead while 8193 did. If I do a 12K and then a 16K random reads and add readahead readhead stats to the output look at the following you can again see 1MB network traffic associated with the 12KB random read, but now we also see 3 lustre cache misses since the readahead occurs on the 3rd page and nothing is in the cache yet. #<-----------Network----------><-------------Lustre Client--------------> #netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses 0 8 1 16 3 12 0 0 0 3 1086 757 31 408 0 0 0 0 0 0 0 7 1 22 0 0 0 0 0 0 0 9 1 26 0 0 0 0 0 0 0 10 2 29 0 0 0 0 0 0 0 8 1 20 0 0 0 0 0 0 0 10 1 21 4 16 0 0 1 3 2159 1478 56 781 0 0 0 0 0 0 By my question is why are we seeing a 2MB readahead (network traffic) when I''m only reading 16KB? Is it that lustre does a 1MB readahead when the 3th page is read and another 1MB when the forth page is read? That doesn''t sound right to me. Further, looking at the hits/misses you can also see the first 3 pages are read over the network and the fourth comes out of cache because of the readahead on the 3rd. So again, where is the 2MB coming from? If anyone is interested, these stats come from collectl, which I''ve mentioned in the past: http://collectl.sourceforge.net/ There is an even more detailed format for readahead stats but I don''t think anything else is relevant to this particular situation: # LUSTRE CLIENT SUMMARY: READAHEAD # Reads ReadKB Writes WriteKB Pend Hits Misses NotCon MisWin LckFal Discrd ZFile ZerWin RA2Eof HitMax 4 16 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -mark Andreas Dilger wrote:> On Dec 11, 2007 21:42 +0300, Nikita Danilov wrote: > >> Peter Braam writes: >> > Can anyone tell me if read ahead in Lustre includes "early return" >> > features. I mean that if I read 4K and readahead decides to fetch 1M >> > will my request get serviced when the first 4K arrives? Is this important? >> >> Currently read system call will proceed when the first RPC (including >> first 4K page and some number of read-ahead pages) is serviced: >> generic_file_read() waits on a page lock, and lock is released by >> completion routine (ll_ap_completion()). >> > > Another thing worth mentioning here is that if this is the FIRST 4kB read > from the file, then only that 4kB will be returned in the RPC, because > readahead hasn''t done linear vs. random IO detection yet. If it is the > second read (and linear) then the client will get the _rest_ of the 1MB > and will have to wait for that second RPC to complete. For subsequent > reads the readahead will of course prefetch the pages. > > For random reads the code does understand the difference between e.g. > reads of 16 sequential pages (64kB generally) read at non-consecutive > offsets and 16 sequential 4kB page reads. The former will NOT start > readahead, while the latter does. > > Two areas where our readahead is lacking are: > - strided reads (may turn the above 16 x 4kB reads into a situation > where the client will prefetch pages instead of "random" IO, depending > on access pattern, and will avoid prefetch of data the client is not > expecting to use) > - limiting the readahead to the rate that the client is actually consuming > it (currently once we detect sequential reads the readahead window grows > eventually to the maximum even if this is far more than what the client > needs) > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel >
Andreas Dilger writes: [...] > > Two areas where our readahead is lacking are: > - strided reads (may turn the above 16 x 4kB reads into a situation > where the client will prefetch pages instead of "random" IO, depending > on access pattern, and will avoid prefetch of data the client is not > expecting to use) > - limiting the readahead to the rate that the client is actually consuming > it (currently once we detect sequential reads the readahead window grows > eventually to the maximum even if this is far more than what the client > needs) I wonder how useful can inter-file read-ahead be. For example, starting an executable almost always incurs a sequence of reads of the shared libraries, compilation re-reads header files in the same order over and over again, etc. > > Cheers, Andreas Nikita.
In message <20071211234708.GT3214 at webber.adilger.int>,Andreas Dilger writes:>For random reads the code does understand the difference between e.g. >reads of 16 sequential pages (64kB generally) read at non-consecutive >offsets and 16 sequential 4kB page reads. The former will NOT start >readahead, while the latter does.what about direct i/o? it looks like doing direct i/o to a file will never trigger readahead. is this intentional?
Hello! On Dec 12, 2007, at 10:52 AM, chas williams - CONTRACTOR wrote:> In message <20071211234708.GT3214 at webber.adilger.int>,Andreas Dilger > writes: >> For random reads the code does understand the difference between e.g. >> reads of 16 sequential pages (64kB generally) read at non-consecutive >> offsets and 16 sequential 4kB page reads. The former will NOT start >> readahead, while the latter does. > what about direct i/o? it looks like doing direct i/o to a file > will never trigger readahead. is this intentional?Yes it is. DIRECT IO by it''s nature is "direct", i.e. it gets straight into application buffers, we are not to put any more data there than application said it can accept. And we are not allowed to cache any of that data (or to use any cache to get the data during reads) too. Bye, Oleg
In message <6C95C14E-9F4E-4E3E-A435-5010C2EFC999 at sun.com>,Oleg Drokin writes:>Yes it is. >DIRECT IO by it''s nature is "direct", i.e. it gets straight into >application buffers, >we are not to put any more data there than application said it can >accept.i agree with this interpretation of O_DIRECT.>And we are not allowed to cache any of that data (or to use any cache >to get >the data during reads) too.this seems a bit stricter than i would expect. while i would expect O_DIRECT to bypass the kernel readahead mechanism, i think that lustre''s readahead is outside of the scope of O_DIRECT.
Hello! On Dec 13, 2007, at 10:15 AM, chas williams - CONTRACTOR wrote:>> Yes it is. >> DIRECT IO by it''s nature is "direct", i.e. it gets straight into >> application buffers, >> we are not to put any more data there than application said it can >> accept. > i agree with this interpretation of O_DIRECT. > >> And we are not allowed to cache any of that data (or to use any cache >> to get >> the data during reads) too. > this seems a bit stricter than i would expect. while i would expect > O_DIRECT to bypass the kernel readahead mechanism, i think that > lustre''s > readahead is outside of the scope of O_DIRECT.While technically nothing prevents us from reading some more stuff into client cache (we cannot do any readahead into applications buffers for obvious reasons) with readahead after genuine directio request, we cannot use the cache for directio requests, so if next read request comes in directio mode, we MUST bypass cache and the whole purpose of readahead is therefore defeated. Bye, Oleg
Tom.Wang wrote:> Hi, Peter > > I just talked with Matt about his pCIFS test. In his test, there are 2 > clients, 2 OST, (stripe_count =2, stripe_size=1M). > Each client run 1 thread to read a shared file. Each thread(client) > will only read data of the shared file from 1 OST, so > read request is discontiguous (1M step stride I/O) for each of thread, > which will not trigger read-ahead, because > read-ahead window will be reset once it met discontinuous read > request. So kept read-ahead in the same step > of stride step ((stripe_count -1) * stripe_size) instead of > contiguous could just fix this problem, and it is what we are > working now. >Finally! This makes a lot of sense.> Btw: we also need this in lustre collective read ADIO driver, > since we will also need reorganize the data as 1client-> 1OST > model like pCIFS does.I now see why this stripe behavior with read-ahead is so very important for Lustre in some cases. It would also be important to know that in basic cases we really get as good read performance as write. - Peter -> Another alternative way maybe move read-ahead mechanism to > osc layer, but maybe a bad idea. > Thanks > WangDi >
On Dec 12, 2007 08:53 +0300, Nikita Danilov wrote:> Andreas Dilger writes: > > Two areas where our readahead is lacking are: > > - strided reads (may turn the above 16 x 4kB reads into a situation > > where the client will prefetch pages instead of "random" IO, depending > > on access pattern, and will avoid prefetch of data the client is not > > expecting to use) > > - limiting the readahead to the rate that the client is actually consuming > > it (currently once we detect sequential reads the readahead window grows > > eventually to the maximum even if this is far more than what the client > > needs) > > I wonder how useful can inter-file read-ahead be. For example, starting > an executable almost always incurs a sequence of reads of the shared > libraries, compilation re-reads header files in the same order over and > over again, etc.Well, we already have a beginning of this kind of operation on the client with client-side metadata statahead. That detects readdir->stat operation and prefetches the MDS attribute data. The next step would be OST statahead, which could be started asynchronously as soon as the LOV EA is returned from the MDS instead of waiting for the userspace process to get to that entry and force the OST stat. OST statahead will not be needed on 1.8 in many cases when size-on-MDS is available (if file is closed) but would still be useful for 1.6 and the case of "impatient user running ''ls -l'' in the job output directory while files are being written". The logical extension would be to detect readdir + read (for e.g. updatedb, find ... | xargs grep, etc) type loads and prefetch the file data if it is not too big, or at least just the first block for "file" or similar. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Thu, 2007-12-13 at 17:21 -0700, Andreas Dilger wrote:> On Dec 12, 2007 08:53 +0300, Nikita Danilov wrote: > > Andreas Dilger writes: > > > Two areas where our readahead is lacking are: > > > - strided reads (may turn the above 16 x 4kB reads into a situation > > > where the client will prefetch pages instead of "random" IO, depending > > > on access pattern, and will avoid prefetch of data the client is not > > > expecting to use) > > > - limiting the readahead to the rate that the client is actually consuming > > > it (currently once we detect sequential reads the readahead window grows > > > eventually to the maximum even if this is far more than what the client > > > needs) > > > > I wonder how useful can inter-file read-ahead be. For example, starting > > an executable almost always incurs a sequence of reads of the shared > > libraries, compilation re-reads header files in the same order over and > > over again, etc. > > Well, we already have a beginning of this kind of operation on the client > with client-side metadata statahead. That detects readdir->stat operation > and prefetches the MDS attribute data.I''m not sure this was good idea. stat readahead is slow for single cpu box, and slow in testing with local devices (very low network latency) - and can add noticeable speedup only with high latency network links, because can send many stats requests when main thread block in waiting one answer. -- Alex Lyashkov <Alexey.lyashkov at sun.com> Lustre Group, Sun Microsystems
On Dec 20, 2007 12:20 +0200, Alex Lyashkov wrote:> On Thu, 2007-12-13 at 17:21 -0700, Andreas Dilger wrote: > > Well, we already have a beginning of this kind of operation on the client > > with client-side metadata statahead. That detects readdir->stat operation > > and prefetches the MDS attribute data. > > I''m not sure this was good idea. stat readahead is slow for single cpu > box, and slow in testing with local devices (very low network latency) - > and can add noticeable speedup only with high latency network links, > because can send many stats requests when main thread block in waiting > one answer.This isn''t the common configuration that we run in, however. Most Lustre setups today (and even laptops) have multiple CPUs, and clients are generally remote instead of local. In such testing the statahead is about 2x faster than without, and I think it could be even faster if we also had OST statahead. The current statahead is limited by the fact we are only hiding the client->MDS latency (1 of 2 synchronous RPCs) and not the client->OSTs latency (which happen concurrently for N stripes, but will be the slowest of all OST getattrs). So our maximum "ls -l" improvement is currently 2x until SOM (for closed files) or OST statahead is implemented. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.