thr3ads.net - Lustre discuss - [Lustre-discuss] Bad read performance [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Alvaro Aguilera

2009-Aug-20 21:52 UTC

[Lustre-discuss] Bad read performance

Hello,

as a project for college I''m doing a behavioral comparison between
Lustre
and CXFS when dealing with simple strided files using POSIX semantics. On
one of the tests, each participating process reads 16 chunks of data with a
size of 32MB each, from a common, strided file using the following code:

------------------------------------------------------------------------------------------
int myfile = open("thefile", O_RDONLY);

MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help measuring time

off_t distance = (numtasks-1)*p.buffersize;
off_t offset = rank*p.buffersize;

int j;
lseek(myfile, offset, SEEK_SET);
for (j = 0; j < p.buffercount; j++) {
       read(myfile, buffers[j], p.buffersize); // buffers are aligned to the
page size
       lseek(myfile, distance, SEEK_CUR);
}

MPI_Barrier(MPI_COMM_WORLD);

close(myfile);
------------------------------------------------------------------------------------------

I''m facing the following problem: when this code is run in parallel the
read
operations on certain processes start to need more and more time to
complete. I attached a graphical trace of this, when using only 2 processes.

As you see, the read operations on process 0 stay more or less constant,
taking about 0.12 seconds to complete, while on process 1 they increase up
to 39 seconds!

If I run the program with only one process, then the time stays at ~0.12
seconds per read operation. The problem doesn''t appear if the O_DIRECT
flag
is used.

Can somebody explain to me why is this happening? Since I''m very new to
Lustre, I may be making some silly mistakes, so be nice to me ;)

I''m using Lustre SLES 10 Patchlevel 1, Kernel
2.6.16.54-0.2.5_lustre.1.6.5.1.


Thanks!

Alvaro Aguilera.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/0a94ed4e/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.png
Type: image/png
Size: 7495 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/0a94ed4e/attachment-0001.png

Brian J. Murrell

2009-Aug-20 22:04 UTC

head link

[Lustre-discuss] Bad read performance

On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera
wrote:> I''m facing the following problem: when this code is run in
parallel
> the read operations on certain processes start to need more and more
> time to complete. I attached a graphical trace of this, when using
> only 2 processes. 
Just a (perhaps silly) question, but does the striping of the file (or
the directory the file is being created in) on the filesystem match your
I/O patterns?  That is, ideally, each thread/rank/process (whatever you
want to call them) should be doing I/O in it''s own stripe.

$ man lfs

if none of this is meaningful.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/75ae8df0/attachment.bin

Alvaro Aguilera

2009-Aug-21 02:38 UTC

head link

[Lustre-discuss] Bad read performance

Thanks for pointing that out. I was using the default striping, which in my
case is 1mb stripes, on one OST.

However,  If I change the stripe size to 32mb (the size of the buffers being
written/read), the function used to write the file using O_DIRECT stops
working. Its code is very similar to the one posted above and the problem is
that the write()-function stucks while writing the first buffer. Is there
any trick for using O_DIRECT in Lustre? I''ve aligned the buffer using
posix_memalign(), and every offset and count seem to be a multiple of the
page-size (4kb).

On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at
sun.com>wrote:
> On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote:
> > I''m facing the following problem: when this code is run in
parallel
> > the read operations on certain processes start to need more and more
> > time to complete. I attached a graphical trace of this, when using
> > only 2 processes.
>
> Just a (perhaps silly) question, but does the striping of the file (or
> the directory the file is being created in) on the filesystem match your
> I/O patterns?  That is, ideally, each thread/rank/process (whatever you
> want to call them) should be doing I/O in it''s own stripe.
>
> $ man lfs
>
> if none of this is meaningful.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/07fbef6d/attachment.html

Oleg Drokin

2009-Aug-21 02:57 UTC

head link

[Lustre-discuss] Bad read performance

Hello!

    Any chance you can use more modern release like 1.8.1? There was a  
number of bugs fixed including some readahead-logic fixes that could  
impede read performance.

Bye,
     Oleg
On Aug 20, 2009, at 10:38 PM, Alvaro Aguilera wrote:
> Thanks for pointing that out. I was using the default striping,  
> which in my case is 1mb stripes, on one OST.
>
> However,  If I change the stripe size to 32mb (the size of the  
> buffers being written/read), the function used to write the file  
> using O_DIRECT stops working. Its code is very similar to the one  
> posted above and the problem is that the write()-function stucks  
> while writing the first buffer. Is there any trick for using  
> O_DIRECT in Lustre? I''ve aligned the buffer using
posix_memalign(),
> and every offset and count seem to be a multiple of the page-size  
> (4kb).
>
>
>
>
>
> On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at
sun.com
> > wrote:
> On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote:
> > I''m facing the following problem: when this code is run in
parallel
> > the read operations on certain processes start to need more and more
> > time to complete. I attached a graphical trace of this, when using
> > only 2 processes.
>
> Just a (perhaps silly) question, but does the striping of the file (or
> the directory the file is being created in) on the filesystem match  
> your
> I/O patterns?  That is, ideally, each thread/rank/process (whatever  
> you
> want to call them) should be doing I/O in it''s own stripe.
>
> $ man lfs
>
> if none of this is meaningful.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

di wang

2009-Aug-21 04:32 UTC

head link

[Lustre-discuss] Bad read performance

Hello,

You may see bug 17197 and try to apply this patch 
https://bugzilla.lustre.org/attachment.cgi?id=25062  to your lustre src. 
Or you can wait 1.8.2.

Thanks
Wangdi

Alvaro Aguilera wrote:> Hello,
>
> as a project for college I''m doing a behavioral comparison between
> Lustre and CXFS when dealing with simple strided files using POSIX 
> semantics. On one of the tests, each participating process reads 16 
> chunks of data with a size of 32MB each, from a common, strided file 
> using the following code:
>
>
------------------------------------------------------------------------------------------
> int myfile = open("thefile", O_RDONLY);
>
> MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help 
> measuring time
>
> off_t distance = (numtasks-1)*p.buffersize;
> off_t offset = rank*p.buffersize;
>
> int j;
> lseek(myfile, offset, SEEK_SET);
> for (j = 0; j < p.buffercount; j++) {
>        read(myfile, buffers[j], p.buffersize); // buffers are aligned 
> to the page size
>        lseek(myfile, distance, SEEK_CUR);
> }
>
> MPI_Barrier(MPI_COMM_WORLD);
>
> close(myfile);
>
------------------------------------------------------------------------------------------
>
> I''m facing the following problem: when this code is run in
parallel
> the read operations on certain processes start to need more and more 
> time to complete. I attached a graphical trace of this, when using 
> only 2 processes.
> As you see, the read operations on process 0 stay more or less 
> constant, taking about 0.12 seconds to complete, while on process 1 
> they increase up to 39 seconds!
>
> If I run the program with only one process, then the time stays at 
> ~0.12 seconds per read operation. The problem doesn''t appear if
the
> O_DIRECT flag is used.
>
> Can somebody explain to me why is this happening? Since I''m very
new
> to Lustre, I may be making some silly mistakes, so be nice to me ;)
>
> I''m using Lustre SLES 10 Patchlevel 1, Kernel 
> 2.6.16.54-0.2.5_lustre.1.6.5.1.
>
>
> Thanks!
>
> Alvaro Aguilera.
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Alvaro Aguilera

2009-Aug-21 09:03 UTC

head link

[Lustre-discuss] Bad read performance

no, for the time being I''m stuck with this version...

Regards,
Alvaro.

On Fri, Aug 21, 2009 at 4:57 AM, Oleg Drokin <Oleg.Drokin at sun.com>
wrote:
> Hello!
>
>   Any chance you can use more modern release like 1.8.1? There was a number
> of bugs fixed including some readahead-logic fixes that could impede read
> performance.
>
> Bye,
>    Oleg
>
> On Aug 20, 2009, at 10:38 PM, Alvaro Aguilera wrote:
>
>  Thanks for pointing that out. I was using the default striping, which in
>> my case is 1mb stripes, on one OST.
>>
>> However,  If I change the stripe size to 32mb (the size of the buffers
>> being written/read), the function used to write the file using O_DIRECT
>> stops working. Its code is very similar to the one posted above and the
>> problem is that the write()-function stucks while writing the first
buffer.
>> Is there any trick for using O_DIRECT in Lustre? I''ve aligned
the buffer
>> using posix_memalign(), and every offset and count seem to be a
multiple of
>> the page-size (4kb).
>>
>>
>>
>>
>>
>> On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at
sun.com>
>> wrote:
>> On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote:
>> > I''m facing the following problem: when this code is run
in parallel
>> > the read operations on certain processes start to need more and
more
>> > time to complete. I attached a graphical trace of this, when using
>> > only 2 processes.
>>
>> Just a (perhaps silly) question, but does the striping of the file (or
>> the directory the file is being created in) on the filesystem match
your
>> I/O patterns?  That is, ideally, each thread/rank/process (whatever you
>> want to call them) should be doing I/O in it''s own stripe.
>>
>> $ man lfs
>>
>> if none of this is meaningful.
>>
>> b.
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/d7f686e0/attachment-0001.html

Alvaro Aguilera

2009-Aug-21 09:09 UTC

head link

[Lustre-discuss] Bad read performance

thanks for the hint, but unfortunately I can''t make any updates to the
cluster...

Do you think both of the problems I experienced are bugs in Lustre and are
resolved in current versions?

Thanks.
Alvaro.

On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com> wrote:
> Hello,
>
> You may see bug 17197 and try to apply this patch
> https://bugzilla.lustre.org/attachment.cgi?id=25062  to your lustre src.
> Or you can wait 1.8.2.
>
> Thanks
> Wangdi
>
> Alvaro Aguilera wrote:
>
>> Hello,
>>
>> as a project for college I''m doing a behavioral comparison
between Lustre
>> and CXFS when dealing with simple strided files using POSIX semantics.
On
>> one of the tests, each participating process reads 16 chunks of data
with a
>> size of 32MB each, from a common, strided file using the following
code:
>>
>>
>>
------------------------------------------------------------------------------------------
>> int myfile = open("thefile", O_RDONLY);
>>
>> MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help measuring
>> time
>>
>> off_t distance = (numtasks-1)*p.buffersize;
>> off_t offset = rank*p.buffersize;
>>
>> int j;
>> lseek(myfile, offset, SEEK_SET);
>> for (j = 0; j < p.buffercount; j++) {
>>       read(myfile, buffers[j], p.buffersize); // buffers are aligned to
>> the page size
>>       lseek(myfile, distance, SEEK_CUR);
>> }
>>
>> MPI_Barrier(MPI_COMM_WORLD);
>>
>> close(myfile);
>>
>>
------------------------------------------------------------------------------------------
>>
>> I''m facing the following problem: when this code is run in
parallel the
>> read operations on certain processes start to need more and more time
to
>> complete. I attached a graphical trace of this, when using only 2
processes.
>> As you see, the read operations on process 0 stay more or less
constant,
>> taking about 0.12 seconds to complete, while on process 1 they increase
up
>> to 39 seconds!
>>
>> If I run the program with only one process, then the time stays at
~0.12
>> seconds per read operation. The problem doesn''t appear if the
O_DIRECT flag
>> is used.
>>
>> Can somebody explain to me why is this happening? Since I''m
very new to
>> Lustre, I may be making some silly mistakes, so be nice to me ;)
>>
>> I''m using Lustre SLES 10 Patchlevel 1, Kernel
>> 2.6.16.54-0.2.5_lustre.1.6.5.1.
>>
>>
>> Thanks!
>>
>> Alvaro Aguilera.
>>
>>
>>
------------------------------------------------------------------------
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/ae29e256/attachment.html

di wang

2009-Aug-21 13:15 UTC

head link

[Lustre-discuss] Bad read performance

Alvaro Aguilera wrote:> thanks for the hint, but unfortunately I can''t make any updates to
the
> cluster...
>
> Do you think both of the problems I experienced are bugs in Lustre and 
> are resolved in current versions?It should be lustre bugs. The 2 processes runs on different node or same 
node?

Thanks
WangDi>
> Thanks.
> Alvaro.
>
> On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com 
> <mailto:di.wang at sun.com>> wrote:
>
>     Hello,
>
>     You may see bug 17197 and try to apply this patch
>     https://bugzilla.lustre.org/attachment.cgi?id=25062  to your
>     lustre src. Or you can wait 1.8.2.
>
>     Thanks
>     Wangdi
>
>     Alvaro Aguilera wrote:
>
>         Hello,
>
>         as a project for college I''m doing a behavioral comparison
>         between Lustre and CXFS when dealing with simple strided files
>         using POSIX semantics. On one of the tests, each participating
>         process reads 16 chunks of data with a size of 32MB each, from
>         a common, strided file using the following code:
>
>        
------------------------------------------------------------------------------------------
>         int myfile = open("thefile", O_RDONLY);
>
>         MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help
>         measuring time
>
>         off_t distance = (numtasks-1)*p.buffersize;
>         off_t offset = rank*p.buffersize;
>
>         int j;
>         lseek(myfile, offset, SEEK_SET);
>         for (j = 0; j < p.buffercount; j++) {
>               read(myfile, buffers[j], p.buffersize); // buffers are
>         aligned to the page size
>               lseek(myfile, distance, SEEK_CUR);
>         }
>
>         MPI_Barrier(MPI_COMM_WORLD);
>
>         close(myfile);
>        
------------------------------------------------------------------------------------------
>
>         I''m facing the following problem: when this code is run in
>         parallel the read operations on certain processes start to
>         need more and more time to complete. I attached a graphical
>         trace of this, when using only 2 processes.
>         As you see, the read operations on process 0 stay more or less
>         constant, taking about 0.12 seconds to complete, while on
>         process 1 they increase up to 39 seconds!
>
>         If I run the program with only one process, then the time
>         stays at ~0.12 seconds per read operation. The problem
doesn''t
>         appear if the O_DIRECT flag is used.
>
>         Can somebody explain to me why is this happening? Since
I''m
>         very new to Lustre, I may be making some silly mistakes, so be
>         nice to me ;)
>
>         I''m using Lustre SLES 10 Patchlevel 1, Kernel
>         2.6.16.54-0.2.5_lustre.1.6.5.1.
>
>
>         Thanks!
>
>         Alvaro Aguilera.
>
>
>        
------------------------------------------------------------------------
>
>        
------------------------------------------------------------------------
>
>
>
>         _______________________________________________
>         Lustre-discuss mailing list
>         Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>
>         http://lists.lustre.org/mailman/listinfo/lustre-discuss
>          
>
>
>

Alvaro Aguilera

2009-Aug-21 13:18 UTC

head link

[Lustre-discuss] Bad read performance

they run on different physical nodes and access the ost via 4x infiniband.

On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com> wrote:
> Alvaro Aguilera wrote:
>
>> thanks for the hint, but unfortunately I can''t make any
updates to the
>> cluster...
>>
>> Do you think both of the problems I experienced are bugs in Lustre and
are
>> resolved in current versions?
>>
> It should be lustre bugs. The 2 processes runs on different node or same
> node?
>
> Thanks
> WangDi
>
>>
>> Thanks.
>> Alvaro.
>>
>>
>> On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com
<mailto:
>> di.wang at sun.com>> wrote:
>>
>>    Hello,
>>
>>    You may see bug 17197 and try to apply this patch
>>    https://bugzilla.lustre.org/attachment.cgi?id=25062  to your
>>    lustre src. Or you can wait 1.8.2.
>>
>>    Thanks
>>    Wangdi
>>
>>    Alvaro Aguilera wrote:
>>
>>        Hello,
>>
>>        as a project for college I''m doing a behavioral
comparison
>>        between Lustre and CXFS when dealing with simple strided files
>>        using POSIX semantics. On one of the tests, each participating
>>        process reads 16 chunks of data with a size of 32MB each, from
>>        a common, strided file using the following code:
>>
>>
>> 
------------------------------------------------------------------------------------------
>>        int myfile = open("thefile", O_RDONLY);
>>
>>        MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help
>>        measuring time
>>
>>        off_t distance = (numtasks-1)*p.buffersize;
>>        off_t offset = rank*p.buffersize;
>>
>>        int j;
>>        lseek(myfile, offset, SEEK_SET);
>>        for (j = 0; j < p.buffercount; j++) {
>>              read(myfile, buffers[j], p.buffersize); // buffers are
>>        aligned to the page size
>>              lseek(myfile, distance, SEEK_CUR);
>>        }
>>
>>        MPI_Barrier(MPI_COMM_WORLD);
>>
>>        close(myfile);
>>
>> 
------------------------------------------------------------------------------------------
>>
>>        I''m facing the following problem: when this code is run
in
>>        parallel the read operations on certain processes start to
>>        need more and more time to complete. I attached a graphical
>>        trace of this, when using only 2 processes.
>>        As you see, the read operations on process 0 stay more or less
>>        constant, taking about 0.12 seconds to complete, while on
>>        process 1 they increase up to 39 seconds!
>>
>>        If I run the program with only one process, then the time
>>        stays at ~0.12 seconds per read operation. The problem
doesn''t
>>        appear if the O_DIRECT flag is used.
>>
>>        Can somebody explain to me why is this happening? Since
I''m
>>        very new to Lustre, I may be making some silly mistakes, so be
>>        nice to me ;)
>>
>>        I''m using Lustre SLES 10 Patchlevel 1, Kernel
>>        2.6.16.54-0.2.5_lustre.1.6.5.1.
>>
>>
>>        Thanks!
>>
>>        Alvaro Aguilera.
>>
>>
>>
>> 
------------------------------------------------------------------------
>>
>>
>> 
------------------------------------------------------------------------
>>
>>
>>
>>        _______________________________________________
>>        Lustre-discuss mailing list
>>        Lustre-discuss at lists.lustre.org
>>        <mailto:Lustre-discuss at lists.lustre.org>
>>        http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/895d7abf/attachment-0001.html

di wang

2009-Aug-21 13:38 UTC

head link

[Lustre-discuss] Bad read performance

hello,
Alvaro Aguilera wrote:> they run on different physical nodes and access the ost via 4x infiniband.
>I never heard such problems, if they on different nodes.  Client memory?
Can you post  read-ahead  stats (before and after the test)  here by

lctl get_param llite.*.read_ahead_stats


But there are indeed a lot fixes about stride read since 1.6.5, which is 
included in the tar ball I posted below.
And it probably can fix your problem.

Thanks
WangDi
> On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com 
> <mailto:di.wang at sun.com>> wrote:
>
>     Alvaro Aguilera wrote:
>
>         thanks for the hint, but unfortunately I can''t make any
>         updates to the cluster...
>
>         Do you think both of the problems I experienced are bugs in
>         Lustre and are resolved in current versions?
>
>     It should be lustre bugs. The 2 processes runs on different node
>     or same node?
>
>     Thanks
>     WangDi
>
>
>         Thanks.
>         Alvaro.
>
>
>         On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com
>         <mailto:di.wang at sun.com> <mailto:di.wang at sun.com
>         <mailto:di.wang at sun.com>>> wrote:
>
>            Hello,
>
>            You may see bug 17197 and try to apply this patch
>            https://bugzilla.lustre.org/attachment.cgi?id=25062  to your
>            lustre src. Or you can wait 1.8.2.
>
>            Thanks
>            Wangdi
>
>            Alvaro Aguilera wrote:
>
>                Hello,
>
>                as a project for college I''m doing a behavioral
comparison
>                between Lustre and CXFS when dealing with simple
>         strided files
>                using POSIX semantics. On one of the tests, each
>         participating
>                process reads 16 chunks of data with a size of 32MB
>         each, from
>                a common, strided file using the following code:
>
>              
>         
------------------------------------------------------------------------------------------
>                int myfile = open("thefile", O_RDONLY);
>
>                MPI_Barrier(MPI_COMM_WORLD); // the barriers are only
>         to help
>                measuring time
>
>                off_t distance = (numtasks-1)*p.buffersize;
>                off_t offset = rank*p.buffersize;
>
>                int j;
>                lseek(myfile, offset, SEEK_SET);
>                for (j = 0; j < p.buffercount; j++) {
>                      read(myfile, buffers[j], p.buffersize); //
>         buffers are
>                aligned to the page size
>                      lseek(myfile, distance, SEEK_CUR);
>                }
>
>                MPI_Barrier(MPI_COMM_WORLD);
>
>                close(myfile);
>              
>         
------------------------------------------------------------------------------------------
>
>                I''m facing the following problem: when this code is
run in
>                parallel the read operations on certain processes start to
>                need more and more time to complete. I attached a graphical
>                trace of this, when using only 2 processes.
>                As you see, the read operations on process 0 stay more
>         or less
>                constant, taking about 0.12 seconds to complete, while on
>                process 1 they increase up to 39 seconds!
>
>                If I run the program with only one process, then the time
>                stays at ~0.12 seconds per read operation. The problem
>         doesn''t
>                appear if the O_DIRECT flag is used.
>
>                Can somebody explain to me why is this happening? Since
I''m
>                very new to Lustre, I may be making some silly
>         mistakes, so be
>                nice to me ;)
>
>                I''m using Lustre SLES 10 Patchlevel 1, Kernel
>                2.6.16.54-0.2.5_lustre.1.6.5.1.
>
>
>                Thanks!
>
>                Alvaro Aguilera.
>
>
>              
>         
------------------------------------------------------------------------
>
>              
>         
------------------------------------------------------------------------
>
>
>
>                _______________________________________________
>                Lustre-discuss mailing list
>                Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>
>                <mailto:Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>>
>
>                http://lists.lustre.org/mailman/listinfo/lustre-discuss
>                
>
>
>
>

Alvaro Aguilera

2009-Sep-02 01:00 UTC

head link

[Lustre-discuss] Bad read performance

hi,

here is the requested information:

before test:

llite.fastfs-ffff810102a6a400.read_ahead_statssnapshot_time:        
1251851453.382275 (secs.usecs)
pending issued pages:           0
hits                      7301235
misses                    10546
readpage not consecutive  14369
miss inside window        1
failed grab_cache_page    6285314
failed lock match         0
read but discarded        98955
zero length file          0
zero size window          3495
read-ahead to EOF         172
hit max r-a issue         783042
wrong page from grab_cache_page 0


after:

llite.fastfs-ffff810102a6a400.read_ahead_statssnapshot_time:        
1251851620.183964 (secs.usecs)
pending issued pages:           0
hits                      7506005
misses                    330064
readpage not consecutive  14432
miss inside window        319450
failed grab_cache_page    6322954
failed lock match         17294
read but discarded        98955
zero length file          0
zero size window          3495
read-ahead to EOF         192
hit max r-a issue         837908
wrong page from grab_cache_page 0


there seems to by a lot of misses, as well as a locking problem,
doesn''t it?
Btw. in the test, 4 processes read 512mb each from a 2gb big file.

Regards,
Alvaro.

On Fri, Aug 21, 2009 at 3:38 PM, di wang <di.wang at sun.com> wrote:
> hello,
> Alvaro Aguilera wrote:
>
>> they run on different physical nodes and access the ost via 4x
infiniband.
>>
>>  I never heard such problems, if they on different nodes.  Client
memory?
> Can you post  read-ahead  stats (before and after the test)  here by
>
> lctl get_param llite.*.read_ahead_stats
>
>
> But there are indeed a lot fixes about stride read since 1.6.5, which is
> included in the tar ball I posted below.
> And it probably can fix your problem.
>
> Thanks
> WangDi
>
>  On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com
<mailto:
>> di.wang at sun.com>> wrote:
>>
>>    Alvaro Aguilera wrote:
>>
>>        thanks for the hint, but unfortunately I can''t make any
>>        updates to the cluster...
>>
>>        Do you think both of the problems I experienced are bugs in
>>        Lustre and are resolved in current versions?
>>
>>    It should be lustre bugs. The 2 processes runs on different node
>>    or same node?
>>
>>    Thanks
>>    WangDi
>>
>>
>>        Thanks.
>>        Alvaro.
>>
>>
>>        On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com
>>        <mailto:di.wang at sun.com> <mailto:di.wang at sun.com
>>
>>        <mailto:di.wang at sun.com>>> wrote:
>>
>>           Hello,
>>
>>           You may see bug 17197 and try to apply this patch
>>           https://bugzilla.lustre.org/attachment.cgi?id=25062  to your
>>           lustre src. Or you can wait 1.8.2.
>>
>>           Thanks
>>           Wangdi
>>
>>           Alvaro Aguilera wrote:
>>
>>               Hello,
>>
>>               as a project for college I''m doing a behavioral
comparison
>>               between Lustre and CXFS when dealing with simple
>>        strided files
>>               using POSIX semantics. On one of the tests, each
>>        participating
>>               process reads 16 chunks of data with a size of 32MB
>>        each, from
>>               a common, strided file using the following code:
>>
>>
>>
------------------------------------------------------------------------------------------
>>               int myfile = open("thefile", O_RDONLY);
>>
>>               MPI_Barrier(MPI_COMM_WORLD); // the barriers are only
>>        to help
>>               measuring time
>>
>>               off_t distance = (numtasks-1)*p.buffersize;
>>               off_t offset = rank*p.buffersize;
>>
>>               int j;
>>               lseek(myfile, offset, SEEK_SET);
>>               for (j = 0; j < p.buffercount; j++) {
>>                     read(myfile, buffers[j], p.buffersize); //
>>        buffers are
>>               aligned to the page size
>>                     lseek(myfile, distance, SEEK_CUR);
>>               }
>>
>>               MPI_Barrier(MPI_COMM_WORLD);
>>
>>               close(myfile);
>>
>>
------------------------------------------------------------------------------------------
>>
>>               I''m facing the following problem: when this code
is run in
>>               parallel the read operations on certain processes start
to
>>               need more and more time to complete. I attached a
graphical
>>               trace of this, when using only 2 processes.
>>               As you see, the read operations on process 0 stay more
>>        or less
>>               constant, taking about 0.12 seconds to complete, while on
>>               process 1 they increase up to 39 seconds!
>>
>>               If I run the program with only one process, then the time
>>               stays at ~0.12 seconds per read operation. The problem
>>        doesn''t
>>               appear if the O_DIRECT flag is used.
>>
>>               Can somebody explain to me why is this happening? Since
I''m
>>               very new to Lustre, I may be making some silly
>>        mistakes, so be
>>               nice to me ;)
>>
>>               I''m using Lustre SLES 10 Patchlevel 1, Kernel
>>               2.6.16.54-0.2.5_lustre.1.6.5.1.
>>
>>
>>               Thanks!
>>
>>               Alvaro Aguilera.
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>>
>>
------------------------------------------------------------------------
>>
>>
>>
>>               _______________________________________________
>>               Lustre-discuss mailing list
>>               Lustre-discuss at lists.lustre.org
>>        <mailto:Lustre-discuss at lists.lustre.org>
>>               <mailto:Lustre-discuss at lists.lustre.org
>>        <mailto:Lustre-discuss at lists.lustre.org>>
>>
>>               http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090902/32115639/attachment-0001.html

di wang

2009-Sep-04 00:23 UTC

head link

[Lustre-discuss] Bad read performance

Hello,

Miss_inside_window vs hit is about 3 vs 2,  indeed too high. It probably 
means a lot of pages is read in by read-ahead, but later evicted before 
it is really being accessed. 
So the patch in bug17197 probably fix this problem, and which will be 
included in 1.8.2.

Thanks
WangDi


Alvaro Aguilera wrote:> hi,
>
> here is the requested information:
>
> before test:
>
> llite.fastfs-ffff810102a6a400.read_ahead_stats> snapshot_time:        
1251851453.382275 (secs.usecs)
> pending issued pages:           0
> hits                      7301235
> misses                    10546
> readpage not consecutive  14369
> miss inside window        1
> failed grab_cache_page    6285314
> failed lock match         0
> read but discarded        98955
> zero length file          0
> zero size window          3495
> read-ahead to EOF         172
> hit max r-a issue         783042
> wrong page from grab_cache_page 0
>
>
> after:
>
> llite.fastfs-ffff810102a6a400.read_ahead_stats> snapshot_time:        
1251851620.183964 (secs.usecs)
> pending issued pages:           0
> hits                      7506005
> misses                    330064
> readpage not consecutive  14432
> miss inside window        319450
> failed grab_cache_page    6322954
> failed lock match         17294
> read but discarded        98955
> zero length file          0
> zero size window          3495
> read-ahead to EOF         192
> hit max r-a issue         837908
> wrong page from grab_cache_page 0
>
>
> there seems to by a lot of misses, as well as a locking problem, 
> doesn''t it? Btw. in the test, 4 processes read 512mb each from a
2gb
> big file.
>
> Regards,
> Alvaro.
>
> On Fri, Aug 21, 2009 at 3:38 PM, di wang <di.wang at sun.com 
> <mailto:di.wang at sun.com>> wrote:
>
>     hello,
>
>     Alvaro Aguilera wrote:
>
>         they run on different physical nodes and access the ost via 4x
>         infiniband.
>
>     I never heard such problems, if they on different nodes.  Client
>     memory?
>     Can you post  read-ahead  stats (before and after the test)  here by
>
>     lctl get_param llite.*.read_ahead_stats
>
>
>     But there are indeed a lot fixes about stride read since 1.6.5,
>     which is included in the tar ball I posted below.
>     And it probably can fix your problem.
>
>     Thanks
>     WangDi
>
>         On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com
>         <mailto:di.wang at sun.com> <mailto:di.wang at sun.com
>         <mailto:di.wang at sun.com>>> wrote:
>
>            Alvaro Aguilera wrote:
>
>                thanks for the hint, but unfortunately I can''t make
any
>                updates to the cluster...
>
>                Do you think both of the problems I experienced are bugs in
>                Lustre and are resolved in current versions?
>
>            It should be lustre bugs. The 2 processes runs on different
>         node
>            or same node?
>
>            Thanks
>            WangDi
>
>
>                Thanks.
>                Alvaro.
>
>
>                On Fri, Aug 21, 2009 at 6:32 AM, di wang
>         <di.wang at sun.com <mailto:di.wang at sun.com>
>                <mailto:di.wang at sun.com <mailto:di.wang at
sun.com>>
>         <mailto:di.wang at sun.com <mailto:di.wang at sun.com>
>
>                <mailto:di.wang at sun.com <mailto:di.wang at
sun.com>>>> wrote:
>
>                   Hello,
>
>                   You may see bug 17197 and try to apply this patch
>                   https://bugzilla.lustre.org/attachment.cgi?id=25062
>          to your
>                   lustre src. Or you can wait 1.8.2.
>
>                   Thanks
>                   Wangdi
>
>                   Alvaro Aguilera wrote:
>
>                       Hello,
>
>                       as a project for college I''m doing a
behavioral
>         comparison
>                       between Lustre and CXFS when dealing with simple
>                strided files
>                       using POSIX semantics. On one of the tests, each
>                participating
>                       process reads 16 chunks of data with a size of 32MB
>                each, from
>                       a common, strided file using the following code:
>
>                            
>        
------------------------------------------------------------------------------------------
>                       int myfile = open("thefile", O_RDONLY);
>
>                       MPI_Barrier(MPI_COMM_WORLD); // the barriers are
>         only
>                to help
>                       measuring time
>
>                       off_t distance = (numtasks-1)*p.buffersize;
>                       off_t offset = rank*p.buffersize;
>
>                       int j;
>                       lseek(myfile, offset, SEEK_SET);
>                       for (j = 0; j < p.buffercount; j++) {
>                             read(myfile, buffers[j], p.buffersize); //
>                buffers are
>                       aligned to the page size
>                             lseek(myfile, distance, SEEK_CUR);
>                       }
>
>                       MPI_Barrier(MPI_COMM_WORLD);
>
>                       close(myfile);
>                            
>        
------------------------------------------------------------------------------------------
>
>                       I''m facing the following problem: when this
code
>         is run in
>                       parallel the read operations on certain
>         processes start to
>                       need more and more time to complete. I attached
>         a graphical
>                       trace of this, when using only 2 processes.
>                       As you see, the read operations on process 0
>         stay more
>                or less
>                       constant, taking about 0.12 seconds to complete,
>         while on
>                       process 1 they increase up to 39 seconds!
>
>                       If I run the program with only one process, then
>         the time
>                       stays at ~0.12 seconds per read operation. The
>         problem
>                doesn''t
>                       appear if the O_DIRECT flag is used.
>
>                       Can somebody explain to me why is this
>         happening? Since I''m
>                       very new to Lustre, I may be making some silly
>                mistakes, so be
>                       nice to me ;)
>
>                       I''m using Lustre SLES 10 Patchlevel 1,
Kernel
>                       2.6.16.54-0.2.5_lustre.1.6.5.1.
>
>
>                       Thanks!
>
>                       Alvaro Aguilera.
>
>
>                            
>        
------------------------------------------------------------------------
>
>                            
>        
------------------------------------------------------------------------
>
>
>
>                       _______________________________________________
>                       Lustre-discuss mailing list
>                       Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>
>                <mailto:Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>>
>                       <mailto:Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>
>                <mailto:Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>>>
>
>                      
>         http://lists.lustre.org/mailman/listinfo/lustre-discuss
>                      
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Aug 2009 - Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance

[Lustre-discuss] Bad read performance