thr3ads.net - Lustre discuss - [Lustre-discuss] fseeks on lustre [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Ronald K Long

2010-Apr-07 20:09 UTC

[Lustre-discuss] fseeks on lustre

I am having an issue with our lustre file system.  In our current 
environment on a san file system opening a large file and doing fseeks 
completes in under 2 seconds.  Running that same routine on our lustre 
file system the routine actually never finishes. 

Are there any tunable parameter in lustre that can alleviate this problem?

Thanks in advance. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100407/a7bdbbc6/attachment.html

Andreas Dilger

2010-Apr-09 15:51 UTC

head link

[Lustre-discuss] fseeks on lustre

On 2010-04-07, at 14:09, Ronald K Long wrote:> I am having an issue with our lustre file system.  In our current  
> environment on a san file system opening a large file and doing  
> fseeks completes in under 2 seconds.  Running that same routine on  
> our lustre file system the routine actually never finishes.
Doing fseek() itself is only a client-side operation, so it should  
have no performance impact, UNLESS you are doing SEEK_END, which  
requires that the actual file size be computed on the client.  That  
causes lock revocation from all of the clients and is an expensive  
operation.  Using SEEK_CUR or SEEK_SET has no cost at all.
> Are there any tunable parameter in lustre that can alleviate this  
> problem?
It depends on what the problem really is.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.

Ronald K Long

2010-Apr-13 14:59 UTC

head link

[Lustre-discuss] fseeks on lustre

We are doing SEEK_SET

fseek(fp,offset[i],SEEK_SET

We were running into this same issue on our san file system until we set 
the dma_cache_read_ahead to match our buffer size of 256k.  Just wondering 
if there is away to set that within lustre.  We are running 1.8 on the MDS 
and OSS and the clients running the fseek are are running 1.6

Thanks again.



Rocky



From:
Andreas Dilger <andreas.dilger at oracle.com>
To:
Ronald K Long <rklong at usgs.gov>
Cc:
lustre-discuss at lists.lustre.org
Date:
04/09/2010 10:52 AM
Subject:
Re: [Lustre-discuss] fseeks on lustre



On 2010-04-07, at 14:09, Ronald K Long wrote:> I am having an issue with our lustre file system.  In our current 
> environment on a san file system opening a large file and doing 
> fseeks completes in under 2 seconds.  Running that same routine on 
> our lustre file system the routine actually never finishes.
Doing fseek() itself is only a client-side operation, so it should 
have no performance impact, UNLESS you are doing SEEK_END, which 
requires that the actual file size be computed on the client.  That 
causes lock revocation from all of the clients and is an expensive 
operation.  Using SEEK_CUR or SEEK_SET has no cost at all.
> Are there any tunable parameter in lustre that can alleviate this 
> problem?
It depends on what the problem really is.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100413/a674cb73/attachment.html

Andreas Dilger

2010-Apr-14 04:36 UTC

head link

[Lustre-discuss] fseeks on lustre

On 2010-04-13, at 07:59, Ronald K Long wrote:> We are doing SEEK_SET
>
> fseek(fp,offset[i],SEEK_SET
>
> We were running into this same issue on our san file system until we  
> set the dma_cache_read_ahead to match our buffer size of 256k.  Just  
> wondering if there is away to set that within lustre.  We are  
> running 1.8 on the MDS and OSS and the clients running the fseek are  
> are running 1.6.
Sorry, I didn''t read enough into your question.  When you said  
"opening a large file and doing fseek()" I thought that was the only  
thing you are doing, but really you are doing IOs after the fseek()  
that is presumably what is taking a long time.

It''s true that if you are doing random reads it may cause sub-optimal  
performance because the client cannot do readahead to mitigate the  
network/disk latency.

One problem that we saw some time ago at another customer was that  
their application doing random IO was getting the "IO blocksize" from
the file via {f,}stat() and reading from the file in chunks of  
st_blksize.  Since Lustre returns st_blksize = 2MB, then the  
application wanting 4kB chunks of random data from the file was  
seeking and reading 2MB of extra data for each seek.

It would be worthwhile to strace your application to see if it is  
doing the same thing.
> Andreas Dilger <andreas.dilger at oracle.com> wrote:
>> On 2010-04-07, at 14:09, Ronald K Long wrote:
>> > I am having an issue with our lustre file system.  In our current
>> > environment on a san file system opening a large file and doing
>> > fseeks completes in under 2 seconds.  Running that same routine on
>> > our lustre file system the routine actually never finishes.
>>
>> Doing fseek() itself is only a client-side operation, so it should
>> have no performance impact, UNLESS you are doing SEEK_END, which
>> requires that the actual file size be computed on the client.  That
>> causes lock revocation from all of the clients and is an expensive
>> operation.  Using SEEK_CUR or SEEK_SET has no cost at all.
>>
>> > Are there any tunable parameter in lustre that can alleviate this
>> > problem?
>>
>> It depends on what the problem really is.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.

Ronald K Long

2010-Apr-14 12:08 UTC

head link

[Lustre-discuss] fseeks on lustre

Andreas - Here is a snipet of the strace output. 

_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 2097152, [2097152], SEEK_SET) = 0
read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2097152) = 2097152
_llseek(3, 0, [0], SEEK_SET)            = 0

Thanks again

Rocky 





From:
Andreas Dilger <andreas.dilger at oracle.com>
To:
Ronald K Long <rklong at usgs.gov>
Cc:
lustre-discuss at lists.lustre.org
Date:
04/13/2010 11:50 PM
Subject:
Re: [Lustre-discuss] fseeks on lustre



On 2010-04-13, at 07:59, Ronald K Long wrote:> We are doing SEEK_SET
>
> fseek(fp,offset[i],SEEK_SET
>
> We were running into this same issue on our san file system until we 
> set the dma_cache_read_ahead to match our buffer size of 256k.  Just 
> wondering if there is away to set that within lustre.  We are 
> running 1.8 on the MDS and OSS and the clients running the fseek are 
> are running 1.6.
Sorry, I didn''t read enough into your question.  When you said 
"opening a large file and doing fseek()" I thought that was the only 
thing you are doing, but really you are doing IOs after the fseek() 
that is presumably what is taking a long time.

It''s true that if you are doing random reads it may cause sub-optimal 
performance because the client cannot do readahead to mitigate the 
network/disk latency.

One problem that we saw some time ago at another customer was that 
their application doing random IO was getting the "IO blocksize" from 
the file via {f,}stat() and reading from the file in chunks of 
st_blksize.  Since Lustre returns st_blksize = 2MB, then the 
application wanting 4kB chunks of random data from the file was 
seeking and reading 2MB of extra data for each seek.

It would be worthwhile to strace your application to see if it is 
doing the same thing.
> Andreas Dilger <andreas.dilger at oracle.com> wrote:
>> On 2010-04-07, at 14:09, Ronald K Long wrote:
>> > I am having an issue with our lustre file system.  In our current
>> > environment on a san file system opening a large file and doing
>> > fseeks completes in under 2 seconds.  Running that same routine on
>> > our lustre file system the routine actually never finishes.
>>
>> Doing fseek() itself is only a client-side operation, so it should
>> have no performance impact, UNLESS you are doing SEEK_END, which
>> requires that the actual file size be computed on the client.  That
>> causes lock revocation from all of the clients and is an expensive
>> operation.  Using SEEK_CUR or SEEK_SET has no cost at all.
>>
>> > Are there any tunable parameter in lustre that can alleviate this
>> > problem?
>>
>> It depends on what the problem really is.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/a47a07d1/attachment-0001.html

Brian J. Murrell

2010-Apr-14 12:48 UTC

head link

[Lustre-discuss] fseeks on lustre

On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:> 
> Andreas - Here is a snipet of the strace output. 
> 
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0"..., 2097152) = 2097152 
As Andreas suspected, your application is doing 2MB reads every time.
Does it really need 2MB of data on each read?  If not, can you fix your
application to only read as much data as it actually wants?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/b3b1a4be/attachment.bin

Ronald K Long

2010-Apr-14 18:08 UTC

head link

[Lustre-discuss] fseeks on lustre

We''ve narrowed down the problem quite a bit.

The problematic code snippet is not actually doing any reads or writes;
it''s just doing a massive number of fseek() operations within a couple
of nested loops. (Note: The production code is doing some I/O, but this
snippet was narrowed down to the bare minimum example that exhibited the
problem, which was how we discovered that fseek was the culprit.)

The issue appears to be the behavior of the glibc implementation of
fseek(). Apparently, a call to fseek() on a buffered file stream causes
glibc to flush the stream (regardless of whether a flush is actually
needed). If we modify the snippet to call setvbuf() and disable
buffering on the file stream before any of the fseek() calls, then it
finishes more or less instantly, as you would expect.

The problem is that this offending code is actually buried deep within a
COTS library that we''re using to do image processing (the Hierarchical
Data Format (HDF) library). While we do have access to the source code
for this library and could conceivably modify it, this is a large and
complex library, and a change of this nature would require us to do a
large amount of regression testing to ensure that nothing was broken.

So at the end of the day this is really not a "Lustre problem" per se,
though we would still be interested in any suggestions as to how we can
minimize the effects of this glibc "flush penalty". This penalty is
not
particularly onerous when reading and writing to local disk, but is
obviously more of an issue with a distributed filesystem.

Thank you again for the support.

Rocky

On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:>
> Andreas - Here is a snipet of the strace output.
>
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0"..., 2097152) = 2097152
As Andreas suspected, your application is doing 2MB reads every time.
Does it really need 2MB of data on each read? If not, can you fix your
application to only read as much data as it actually wants?

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/3a773e8a/attachment.html

Andreas Dilger

2010-Apr-14 19:06 UTC

head link

[Lustre-discuss] fseeks on lustre

On 2010-04-14, at 11:08, Ronald K Long wrote:> We''ve narrowed down the problem quite a bit.
>
> The problematic code snippet is not actually doing any reads or  
> writes;
> it''s just doing a massive number of fseek() operations within a
couple
> of nested loops.  (Note: The production code is doing some I/O, but  
> this
> snippet was narrowed down to the bare minimum example that exhibited  
> the
> problem, which was how we discovered that fseek was the culprit.)
>
> The issue appears to be the behavior of the glibc implementation of
> fseek().  Apparently, a call to fseek() on a buffered file stream  
> causes
> glibc to flush the stream (regardless of whether a flush is actually
> needed).  If we modify the snippet to call setvbuf() and disable
> buffering on the file stream before any of the fseek() calls, then it
> finishes more or less instantly, as you would expect.
I''d encourage you to file a bug (preferably with a patch) against  
glibc to fix this.  I''ve had reasonable success in getting problems  
like this fixed upstream.
> The problem is that this offending code is actually buried deep  
> within a
> COTS library that we''re using to do image processing (the
Hierarchical
> Data Format (HDF) library).  While we do have access to the source  
> code
> for this library and could conceivably modify it, this is a large and
> complex library, and a change of this nature would require us to do a
> large amount of regression testing to ensure that nothing was broken.
>
> So at the end of the day this is really not a "Lustre problem"
per se,
> though we would still be interested in any suggestions as to how we  
> can
> minimize the effects of this glibc "flush penalty".  This penalty
is
> not
> particularly onerous when reading and writing to local disk, but is
> obviously more of an issue with a distributed filesystem.
Similarly, HDF + Lustre usage is very common, and I expect that the  
HDF developers would be interested to fix this if possible.
> On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:
> >
> > Andreas - Here is a snipet of the strace output.
> >
> > read(3,  
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> > \0\0"..., 2097152) = 2097152
>
> As Andreas suspected, your application is doing 2MB reads every time.
> Does it really need 2MB of data on each read?  If not, can you fix  
> your
> application to only read as much data as it actually wants?

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.

Ronald K Long

2010-Apr-16 17:26 UTC

head link

[Lustre-discuss] fseeks on lustre

After doing some more digging it looks as though a bug was reported on 
this in 2007. 

https://bugzilla.lustre.org/show_bug.cgi?id=12739

We have loaded the patch for lustre attached to this bug, however when 
running the set_param command I am getting the following error. 

lctl set_param llite*.*.stat_blksize=4096
error: set_param: /proc/{fs,sys}/{lnet,lustre}/llite/lustre*/stat_blksize: 
No such process

Is this patch still valid for 2.6.9-78.0.22.EL_lustre.1.6.7.2smp

Thanks again

Rocky




From:
Andreas Dilger <andreas.dilger at oracle.com>
To:
Ronald K Long <rklong at usgs.gov>
Cc:
"Brian J. Murrell" <Brian.Murrell at Sun.COM>, 
lustre-discuss at lists.lustre.org, lustre-discuss-bounces at lists.lustre.org
Date:
04/14/2010 02:13 PM
Subject:
Re: [Lustre-discuss] fseeks on lustre



On 2010-04-14, at 11:08, Ronald K Long wrote:> We''ve narrowed down the problem quite a bit.
>
> The problematic code snippet is not actually doing any reads or 
> writes;
> it''s just doing a massive number of fseek() operations within a
couple
> of nested loops.  (Note: The production code is doing some I/O, but 
> this
> snippet was narrowed down to the bare minimum example that exhibited 
> the
> problem, which was how we discovered that fseek was the culprit.)
>
> The issue appears to be the behavior of the glibc implementation of
> fseek().  Apparently, a call to fseek() on a buffered file stream 
> causes
> glibc to flush the stream (regardless of whether a flush is actually
> needed).  If we modify the snippet to call setvbuf() and disable
> buffering on the file stream before any of the fseek() calls, then it
> finishes more or less instantly, as you would expect.
I''d encourage you to file a bug (preferably with a patch) against 
glibc to fix this.  I''ve had reasonable success in getting problems 
like this fixed upstream.
> The problem is that this offending code is actually buried deep 
> within a
> COTS library that we''re using to do image processing (the
Hierarchical
> Data Format (HDF) library).  While we do have access to the source 
> code
> for this library and could conceivably modify it, this is a large and
> complex library, and a change of this nature would require us to do a
> large amount of regression testing to ensure that nothing was broken.
>
> So at the end of the day this is really not a "Lustre problem"
per se,
> though we would still be interested in any suggestions as to how we 
> can
> minimize the effects of this glibc "flush penalty".  This penalty
is
> not
> particularly onerous when reading and writing to local disk, but is
> obviously more of an issue with a distributed filesystem.
Similarly, HDF + Lustre usage is very common, and I expect that the 
HDF developers would be interested to fix this if possible.
> On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:
> >
> > Andreas - Here is a snipet of the strace output.
> >
> > read(3, 
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> > \0\0"..., 2097152) = 2097152
>
> As Andreas suspected, your application is doing 2MB reads every time.
> Does it really need 2MB of data on each read?  If not, can you fix 
> your
> application to only read as much data as it actually wants?

Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100416/ed1225a6/attachment-0001.html

Ronald K Long

2010-Apr-22 19:13 UTC

head link

[Lustre-discuss] fseeks on lustre

We were able to find where to tune the stat_blksize after loading the 
patch mentioned below and the fseek function is working correctly with 
this patch installed.

Thanks



Rocky 



After doing some more digging it looks as though a bug was reported on 
this in 2007.   

https://bugzilla.lustre.org/show_bug.cgi?id=12739 

We have loaded the patch for lustre attached to this bug, however when 
running the set_param command I am getting the following error.   

lctl set_param llite*.*.stat_blksize=4096 
error: set_param: /proc/{fs,sys}/{lnet,lustre}/llite/lustre*/stat_blksize: 
No such process 

Is this patch still valid for 2.6.9-78.0.22.EL_lustre.1.6.7.2smp 

Thanks again 

Rocky
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100422/e2c95b36/attachment.html

Lustre discuss - Apr 2010 - fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre

[Lustre-discuss] fseeks on lustre