thr3ads.net - Lustre devel - [Lustre-devel] read ahead [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Peter Braam

2007-Dec-11 18:25 UTC

[Lustre-devel] read ahead

Can anyone tell me if read ahead in Lustre includes "early return" 
features.  I mean that if I read 4K and readahead decides to fetch 1M 
will my request get serviced when the first 4K arrives?  Is this important?

- Peter -

Nikita Danilov

2007-Dec-11 18:42 UTC

head link

[Lustre-devel] read ahead

Peter Braam writes:
 > Can anyone tell me if read ahead in Lustre includes "early
return"
 > features.  I mean that if I read 4K and readahead decides to fetch 1M 
 > will my request get serviced when the first 4K arrives?  Is this
important?

Currently read system call will proceed when the first RPC (including
first 4K page and some number of read-ahead pages) is serviced:
generic_file_read() waits on a page lock, and lock is released by
completion routine (ll_ap_completion()).

The problem with the "early return" (if I understood it correctly) is
that large read-ahead window is an indication of sequential access, and
in that case read is going to wait on the next page anyway.

 > 
 > - Peter -

Nikita.

Oleg Drokin

2007-Dec-11 18:43 UTC

head link

[Lustre-devel] read ahead

Hello!

On Dec 11, 2007, at 1:25 PM, Peter Braam wrote:
> Can anyone tell me if read ahead in Lustre includes "early
return"
> features.  I mean that if I read 4K and readahead decides to fetch  
> 1M will my request get serviced when the first 4K arrives?  Is this  
> important?
I think this is impossible to implement with current architecture.
We have one bulk RPC (1M in size) that until received completely,  
won''t issue any callbacks.
So only when that entire 1M is received your 4k request would return.
On the other hand if your example is 4k and 2M, then we will return  
after 1M that contains requested 4k is received (but there is no  
guarantee at the moment we won''t receive second 2M first, I believe).

Bye,
     Oleg

Peter Braam

2007-Dec-11 18:59 UTC

head link

[Lustre-devel] read ahead

This might be quite damaging in some situations - for example, if the 
server has the 4K data cached in RAM it should refuse to do a disk read 
probably, but in order to do so it would need to know that part of the 
request is optional, while the 4K is mandatory.

Can we give hints to the OSC about what part of I/O is requested by 
applications and what is requested for read-ahead?  If so, could we use 
a more interesting IOV to do this faster?

- Peter -

Oleg Drokin wrote:> Hello!
>
> On Dec 11, 2007, at 1:25 PM, Peter Braam wrote:
>
>> Can anyone tell me if read ahead in Lustre includes "early
return"
>> features.  I mean that if I read 4K and readahead decides to fetch 1M 
>> will my request get serviced when the first 4K arrives?  Is this 
>> important?
>
> I think this is impossible to implement with current architecture.
> We have one bulk RPC (1M in size) that until received completely, 
> won''t issue any callbacks.
> So only when that entire 1M is received your 4k request would return.
> On the other hand if your example is 4k and 2M, then we will return 
> after 1M that contains requested 4k is received (but there is no 
> guarantee at the moment we won''t receive second 2M first, I
believe).
>
> Bye,
>     Oleg

Oleg Drokin

2007-Dec-11 19:16 UTC

head link

[Lustre-devel] read ahead

Hello!

    Unfortunately, currently osc has no idea about what was original  
read request. Original request size is only known in ll_file_read,
    that only gets proper lock. Then we jump into generic_file_read  
that calls ll_readpage for every page that needs to be read.
    ll_readpage has no idea how many more pages are going to be read  
in this request if any more, so we just try to stuff as much as we
    can into RPC (within our redahead window). Actually, now that I  
look into it, there is special readahead structure filled that tells
    how big this read reqest is, so ll_readahed can adjust the window  
size for the entire read request to fit in. So it seems it is possible
    to see what pages are readahead and what are from original request  
at ll_readahead level and we can pass that info down to osc as some
    sort of flag if needed.
    But we do not (yet?) have any caching on OST aside from device  
cache and we have no way to know what''s in device cache too.
    I am not sure what do you mean by more interesting iov.

Bye,
     Oleg
On Dec 11, 2007, at 1:59 PM, Peter Braam wrote:
> This might be quite damaging in some situations - for example, if  
> the server has the 4K data cached in RAM it should refuse to do a  
> disk read probably, but in order to do so it would need to know that  
> part of the request is optional, while the 4K is mandatory.
>
> Can we give hints to the OSC about what part of I/O is requested by  
> applications and what is requested for read-ahead?  If so, could we  
> use a more interesting IOV to do this faster?
>
> - Peter -
>
> Oleg Drokin wrote:
>> Hello!
>>
>> On Dec 11, 2007, at 1:25 PM, Peter Braam wrote:
>>
>>> Can anyone tell me if read ahead in Lustre includes "early
return"
>>> features.  I mean that if I read 4K and readahead decides to fetch
>>> 1M will my request get serviced when the first 4K arrives?  Is  
>>> this important?
>>
>> I think this is impossible to implement with current architecture.
>> We have one bulk RPC (1M in size) that until received completely,  
>> won''t issue any callbacks.
>> So only when that entire 1M is received your 4k request would return.
>> On the other hand if your example is 4k and 2M, then we will return  
>> after 1M that contains requested 4k is received (but there is no  
>> guarantee at the moment we won''t receive second 2M first, I
believe).
>>
>> Bye,
>>    Oleg

Andreas Dilger

2007-Dec-11 23:47 UTC

head link

[Lustre-devel] read ahead

On Dec 11, 2007  21:42 +0300, Nikita Danilov wrote:> Peter Braam writes:
>  > Can anyone tell me if read ahead in Lustre includes "early
return"
>  > features.  I mean that if I read 4K and readahead decides to fetch 1M
>  > will my request get serviced when the first 4K arrives?  Is this
important?
> 
> Currently read system call will proceed when the first RPC (including
> first 4K page and some number of read-ahead pages) is serviced:
> generic_file_read() waits on a page lock, and lock is released by
> completion routine (ll_ap_completion()).
Another thing worth mentioning here is that if this is the FIRST 4kB read
from the file, then only that 4kB will be returned in the RPC, because
readahead hasn''t done linear vs. random IO detection yet.  If it is the
second read (and linear) then the client will get the _rest_ of the 1MB
and will have to wait for that second RPC to complete.  For subsequent
reads the readahead will of course prefetch the pages.

For random reads the code does understand the difference between e.g.
reads of 16 sequential pages (64kB generally) read at non-consecutive
offsets and 16 sequential 4kB page reads.  The former will NOT start
readahead, while the latter does.

Two areas where our readahead is lacking are:
- strided reads (may turn the above 16 x 4kB reads into a situation
  where the client will prefetch pages instead of "random" IO,
depending
  on access pattern, and will avoid prefetch of data the client is not
  expecting to use)
- limiting the readahead to the rate that the client is actually consuming
  it (currently once we detect sequential reads the readahead window grows
  eventually to the maximum even if this is far more than what the client
  needs)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Mark Seger

2007-Dec-12 00:57 UTC

head link

[Lustre-devel] read ahead

I thought it was the 3rd read that triggers readahead.   When I track 
network and lustre I/O while doing 4, 8 and 12K random reads look at the 
network traffic:

#<-----------Network----------><-------Lustre Client------>
#netKBi pkt-in  netKBo pkt-out  Reads KBRead Writes KBWrite
      0      9       1      15      1      4      0       0
     18    149      16     165      0      0      0       0
     57    626      56     639      0      0      0       0
     34    382      34     392      2      8      0       0
     12     30       5      40      0      0      0       0
      0     10       1      23      0      0      0       0
      0      8       1      24      0      0      0       0
      1     20       1      23      3     12      0       0
   1087    758      32     422      0      0      0       0

Since this is a shared network, you''re seeing ''noise''
on the link on the
order of about 10-50MB/sec, but the spike of over 1M is clearly a result 
due to the readahead.  I had also done earlier byte level tests in which 
reading 8192 bytes didn''t do readahead while 8193 did.

If I do a 12K and then a 16K random reads and add readahead readhead 
stats to the output look at the following you can again see 1MB network 
traffic associated with the 12KB random read, but now we also see 3 
lustre cache misses since the readahead occurs on the 3rd page and 
nothing is in the cache yet.

#<-----------Network----------><-------------Lustre
Client-------------->
#netKBi pkt-in  netKBo pkt-out  Reads KBRead Writes KBWrite   Hits Misses
      0      8       1      16      3     12      0       0      0      3
   1086    757      31     408      0      0      0       0      0      0
      0      7       1      22      0      0      0       0      0      0
      0      9       1      26      0      0      0       0      0      0
      0     10       2      29      0      0      0       0      0      0
      0      8       1      20      0      0      0       0      0      0
      0     10       1      21      4     16      0       0      1      3
   2159   1478      56     781      0      0      0       0      0      0

By my question is why are we seeing a 2MB readahead (network traffic) 
when I''m only reading 16KB?  Is it that lustre does a 1MB readahead
when
the 3th page is read and another 1MB when the forth page is read?  That 
doesn''t sound right to me.  Further, looking at the hits/misses you can
also see the first 3 pages are read over the network and the fourth 
comes out of cache because of the readahead on the 3rd.  So again, where 
is the 2MB coming from?

If anyone is interested, these stats come from collectl, which I''ve 
mentioned in the past: http://collectl.sourceforge.net/

There is an even more detailed format for readahead stats but I don''t 
think anything else is relevant to this particular situation:

# LUSTRE CLIENT SUMMARY: READAHEAD
# Reads ReadKB  Writes WriteKB  Pend  Hits Misses NotCon MisWin LckFal  
Discrd ZFile ZerWin RA2Eof HitMax
      4     16       0       0     0     1      3      0      0      
0      0      0      0      0      0
      0      0       0       0     0     0      0      0      0      
0      0      0      0      0      0
      0      0       0       0     0     0      0      0      0      
0      0      0      0      0      0

-mark

Andreas Dilger wrote:> On Dec 11, 2007  21:42 +0300, Nikita Danilov wrote:
>   
>> Peter Braam writes:
>>  > Can anyone tell me if read ahead in Lustre includes "early
return"
>>  > features.  I mean that if I read 4K and readahead decides to
fetch 1M
>>  > will my request get serviced when the first 4K arrives?  Is this
important?
>>
>> Currently read system call will proceed when the first RPC (including
>> first 4K page and some number of read-ahead pages) is serviced:
>> generic_file_read() waits on a page lock, and lock is released by
>> completion routine (ll_ap_completion()).
>>     
>
> Another thing worth mentioning here is that if this is the FIRST 4kB read
> from the file, then only that 4kB will be returned in the RPC, because
> readahead hasn''t done linear vs. random IO detection yet.  If it
is the
> second read (and linear) then the client will get the _rest_ of the 1MB
> and will have to wait for that second RPC to complete.  For subsequent
> reads the readahead will of course prefetch the pages.
>
> For random reads the code does understand the difference between e.g.
> reads of 16 sequential pages (64kB generally) read at non-consecutive
> offsets and 16 sequential 4kB page reads.  The former will NOT start
> readahead, while the latter does.
>
> Two areas where our readahead is lacking are:
> - strided reads (may turn the above 16 x 4kB reads into a situation
>   where the client will prefetch pages instead of "random" IO,
depending
>   on access pattern, and will avoid prefetch of data the client is not
>   expecting to use)
> - limiting the readahead to the rate that the client is actually consuming
>   it (currently once we detect sequential reads the readahead window grows
>   eventually to the maximum even if this is far more than what the client
>   needs)
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>

Nikita Danilov

2007-Dec-12 05:53 UTC

head link

[Lustre-devel] read ahead

Andreas Dilger writes:

[...]

 > 
 > Two areas where our readahead is lacking are:
 > - strided reads (may turn the above 16 x 4kB reads into a situation
 >   where the client will prefetch pages instead of "random" IO,
depending
 >   on access pattern, and will avoid prefetch of data the client is not
 >   expecting to use)
 > - limiting the readahead to the rate that the client is actually consuming
 >   it (currently once we detect sequential reads the readahead window grows
 >   eventually to the maximum even if this is far more than what the client
 >   needs)

I wonder how useful can inter-file read-ahead be. For example, starting
an executable almost always incurs a sequence of reads of the shared
libraries, compilation re-reads header files in the same order over and
over again, etc.

 > 
 > Cheers, Andreas

Nikita.

chas williams - CONTRACTOR

2007-Dec-12 15:52 UTC

head link

[Lustre-devel] read ahead

In message <20071211234708.GT3214 at webber.adilger.int>,Andreas Dilger
writes:>For random reads the code does understand the difference between e.g.
>reads of 16 sequential pages (64kB generally) read at non-consecutive
>offsets and 16 sequential 4kB page reads.  The former will NOT start
>readahead, while the latter does.
what about direct i/o?  it looks like doing direct i/o to a file
will never trigger readahead.  is this intentional?

Oleg Drokin

2007-Dec-12 16:41 UTC

head link

[Lustre-devel] read ahead

Hello!

On Dec 12, 2007, at 10:52 AM, chas williams - CONTRACTOR
wrote:> In message <20071211234708.GT3214 at webber.adilger.int>,Andreas
Dilger
> writes:
>> For random reads the code does understand the difference between e.g.
>> reads of 16 sequential pages (64kB generally) read at non-consecutive
>> offsets and 16 sequential 4kB page reads.  The former will NOT start
>> readahead, while the latter does.
> what about direct i/o?  it looks like doing direct i/o to a file
> will never trigger readahead.  is this intentional?
Yes it is.
DIRECT IO by it''s nature is "direct", i.e. it gets straight
into
application buffers,
we are not to put any more data there than application said it can  
accept.
And we are not allowed to cache any of that data (or to use any cache  
to get
the data during reads) too.

Bye,
     Oleg

chas williams - CONTRACTOR

2007-Dec-13 15:15 UTC

head link

[Lustre-devel] read ahead

In message <6C95C14E-9F4E-4E3E-A435-5010C2EFC999 at sun.com>,Oleg Drokin
writes:>Yes it is.
>DIRECT IO by it''s nature is "direct", i.e. it gets
straight into
>application buffers,
>we are not to put any more data there than application said it can  
>accept.
i agree with this interpretation of O_DIRECT.
>And we are not allowed to cache any of that data (or to use any cache  
>to get
>the data during reads) too.
this seems a bit stricter than i would expect.  while i would expect
O_DIRECT to bypass the kernel readahead mechanism, i think that
lustre''s
readahead is outside of the scope of O_DIRECT.

Oleg Drokin

2007-Dec-13 15:23 UTC

head link

[Lustre-devel] read ahead

Hello!

On Dec 13, 2007, at 10:15 AM, chas williams - CONTRACTOR wrote:
>> Yes it is.
>> DIRECT IO by it''s nature is "direct", i.e. it gets
straight into
>> application buffers,
>> we are not to put any more data there than application said it can
>> accept.
> i agree with this interpretation of O_DIRECT.
>
>> And we are not allowed to cache any of that data (or to use any cache
>> to get
>> the data during reads) too.
> this seems a bit stricter than i would expect.  while i would expect
> O_DIRECT to bypass the kernel readahead mechanism, i think that  
> lustre''s
> readahead is outside of the scope of O_DIRECT.
While technically nothing prevents us from reading some more stuff  
into client
cache (we cannot do any readahead into applications buffers for  
obvious reasons)
with readahead after genuine directio request, we cannot use the cache
for directio requests, so if next read request comes in directio mode,  
we MUST
bypass cache and the whole purpose of readahead is therefore defeated.

Bye,
     Oleg

Peter Braam

2007-Dec-13 19:09 UTC

head link

[Lustre-devel] read ahead

Tom.Wang wrote:> Hi, Peter
>
> I just talked with Matt about his pCIFS test. In his test, there are 2 
> clients, 2 OST, (stripe_count =2, stripe_size=1M).
> Each client run 1 thread to read a shared file. Each thread(client) 
> will only read data of the shared file from 1 OST, so
> read request is discontiguous (1M step stride I/O) for each of thread, 
> which will not trigger read-ahead,  because
> read-ahead window will be reset once it met discontinuous  read  
> request.  So  kept  read-ahead in the same step
> of stride step ((stripe_count -1) * stripe_size) instead of 
> contiguous  could just fix this problem, and it is what we are
> working now.
>Finally!  This makes a lot of sense.
> Btw:  we also need this in lustre collective read  ADIO driver,  
> since  we  will also  need  reorganize the  data  as  1client-> 1OST 
>          model  like pCIFS does.
I now see why this stripe behavior with read-ahead is so very important 
for Lustre in some cases.

It would also be important to know that in basic cases we really get as 
good read performance as write.

- Peter -
>          Another alternative way maybe move read-ahead mechanism to 
> osc layer, but maybe a bad idea.
> Thanks
> WangDi
>

Andreas Dilger

2007-Dec-14 00:21 UTC

head link

[Lustre-devel] read ahead

On Dec 12, 2007  08:53 +0300, Nikita Danilov wrote:> Andreas Dilger writes:
>  > Two areas where our readahead is lacking are:
>  > - strided reads (may turn the above 16 x 4kB reads into a situation
>  >   where the client will prefetch pages instead of "random"
IO, depending
>  >   on access pattern, and will avoid prefetch of data the client is
not
>  >   expecting to use)
>  > - limiting the readahead to the rate that the client is actually
consuming
>  >   it (currently once we detect sequential reads the readahead window
grows
>  >   eventually to the maximum even if this is far more than what the
client
>  >   needs)
> 
> I wonder how useful can inter-file read-ahead be. For example, starting
> an executable almost always incurs a sequence of reads of the shared
> libraries, compilation re-reads header files in the same order over and
> over again, etc.
Well, we already have a beginning of this kind of operation on the client
with client-side metadata statahead.  That detects readdir->stat operation
and prefetches the MDS attribute data.  The next step would be OST statahead,
which could be started asynchronously as soon as the LOV EA is returned
from the MDS instead of waiting for the userspace process to get to that
entry and force the OST stat.

OST statahead will not be needed on 1.8 in many cases when size-on-MDS
is available (if file is closed) but would still be useful for 1.6 and
the case of "impatient user running ''ls -l'' in the job
output directory
while files are being written".

The logical extension would be to detect readdir + read (for e.g. updatedb,
find ... | xargs grep, etc) type loads and prefetch the file data if it is
not too big, or at least just the first block for "file" or similar.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Alex Lyashkov

2007-Dec-20 10:20 UTC

head link

[Lustre-devel] read ahead

On Thu, 2007-12-13 at 17:21 -0700, Andreas Dilger wrote:> On Dec 12, 2007  08:53 +0300, Nikita Danilov wrote:
> > Andreas Dilger writes:
> >  > Two areas where our readahead is lacking are:
> >  > - strided reads (may turn the above 16 x 4kB reads into a
situation
> >  >   where the client will prefetch pages instead of
"random" IO, depending
> >  >   on access pattern, and will avoid prefetch of data the client
is not
> >  >   expecting to use)
> >  > - limiting the readahead to the rate that the client is actually
consuming
> >  >   it (currently once we detect sequential reads the readahead
window grows
> >  >   eventually to the maximum even if this is far more than what
the client
> >  >   needs)
> > 
> > I wonder how useful can inter-file read-ahead be. For example,
starting
> > an executable almost always incurs a sequence of reads of the shared
> > libraries, compilation re-reads header files in the same order over
and
> > over again, etc.
> 
> Well, we already have a beginning of this kind of operation on the client
> with client-side metadata statahead.  That detects readdir->stat
operation
> and prefetches the MDS attribute data.  I''m not sure this was good idea. stat readahead is slow for single cpu
box, and slow in testing with local devices (very low network latency) -
and can add noticeable speedup only with high latency network links,
because can send many stats requests when main thread block in waiting
one answer.

-- 
Alex Lyashkov <Alexey.lyashkov at sun.com>
Lustre Group, Sun Microsystems

Andreas Dilger

2007-Dec-20 20:44 UTC

head link

[Lustre-devel] read ahead

On Dec 20, 2007  12:20 +0200, Alex Lyashkov wrote:> On Thu, 2007-12-13 at 17:21 -0700, Andreas Dilger wrote:
> > Well, we already have a beginning of this kind of operation on the
client
> > with client-side metadata statahead.  That detects readdir->stat
operation
> > and prefetches the MDS attribute data.  
>
> I''m not sure this was good idea. stat readahead is slow for single
cpu
> box, and slow in testing with local devices (very low network latency) -
> and can add noticeable speedup only with high latency network links,
> because can send many stats requests when main thread block in waiting
> one answer.
This isn''t the common configuration that we run in, however.  Most
Lustre
setups today (and even laptops) have multiple CPUs, and clients are
generally remote instead of local.  In such testing the statahead is about
2x faster than without, and I think it could be even faster if we also had
OST statahead.

The current statahead is limited by the fact we are only hiding the
client->MDS latency (1 of 2 synchronous RPCs) and not the client->OSTs
latency (which happen concurrently for N stripes, but will be the slowest
of all OST getattrs).  So our maximum "ls -l" improvement is currently
2x until SOM (for closed files) or OST statahead is implemented.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Dec 2007 - read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead

[Lustre-devel] read ahead