thr3ads.net - Lustre devel - [Lustre-devel] i_blksize value in lustre 1.5.97 [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Aurelien Degremont

2007-Feb-13 05:23 UTC

[Lustre-devel] i_blksize value in lustre 1.5.97

Hello

We''ve been facing strange performance behaviors regarding to random I/O
on large files on a Lustre FS. Trying to decipher it, I take a deeper
look into the ''i_blksize'' value used by Lustre, particularly
in
ll_getattr_it() call.

The behavior is: some of our tools do many fseek-fread(few bytes) on
large files in huge quantity. Those tools face very poor performances
regarding to what could have been expected with Lustre traditional
performance on the same configuration.


f = fopen()
pos = somewhere
do
   fseek(f, pos)
   fread(10 bytes, f)
   pos = somewhere_else
loop


Analyzing the behavior, it appears that, for each small fread(), a
read of 2MB was done instead, by the glibc, this leading to poor
performance, seen from the program (even if the global Lustre throughput
was good, for each 2MB read, only few bytes were useful).
So, why the glibc read so much data while only few were requested by the
binary? It seems that this library uses the value ''f_bsize''
returned by
statfs() as optimal readahead blocksize and use it as a buffer. Yet,
this value is computed by Lustre using:

[lustre/llite/llite_lib.c:1567]
inode->i_blkbits = min(PTLRPC_MAX_BRW_BITS+1, LL_MAX_BLKSIZE_BITS);

which is quite a huge value (2MB) and, as our tool does many fseek(),
this buffer is useless and is discard between each read.

We made a small patch thanks to we could change the value returned by
ll_getattr_it() (which filled the statfs struct) as a module parameter. 
The performances are really good using it when reducing the
value to 4KB.
Indeed, Lustre is really better managing and buffering itself the I/O
than the glibc in those cases. So, we are wondering whether it could not
be a good idea to reduce the default value, or using something like a
module param to tune it depending on the kind of I/O on the client.

What''s your idea about this ?

-- 
Aurelien Degremont

Solofo.Ramangalahy@bull.net

2007-Feb-13 05:46 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Aurelien Degremont writes:
 > The behavior is: some of our tools do many fseek-fread(few bytes) on
 > large files in huge quantity. Those tools face very poor performances
 > regarding to what could have been expected with Lustre traditional
 > performance on the same configuration.
 > 
 > 
 > f = fopen()
 > pos = somewhere
 > do
 >    fseek(f, pos)
 >    fread(10 bytes, f)
 >    pos = somewhere_else
 > loop
 > 
 > 
 > Analyzing the behavior, it appears that, for each small fread(), a
 > read of 2MB was done instead, by the glibc, this leading to poor
 > performance, seen from the program (even if the global Lustre throughput
 > was good, for each 2MB read, only few bytes were useful).
 > So, why the glibc read so much data while only few were requested by the
 > binary?

Not sure it is relevant, but there was also some discussion about an
fseek glibc optimisation last year :
http://lkml.org/lkml/2006/2/28/187
"Drastic Slowdown of ''fseek()'' Calls From 2.4 to 2.6 --
VMM Change?"

-- 
solofo

Aurelien Degremont

2007-Feb-13 08:36 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Solofo.Ramangalahy@bull.net wrote:> Not sure it is relevant, but there was also some discussion about an
> fseek glibc optimisation last year :
> http://lkml.org/lkml/2006/2/28/187
> "Drastic Slowdown of ''fseek()'' Calls From 2.4 to 2.6
-- VMM Change?"
> 
This is perfectly relevant because it seems Reiser hit exactly the same 
issue with ReiserFS (the value jumped from 4K in 2.4 to 128K in 2.6).

The solution I propose for Lustre is : As Lustre could buffer the I/O 
greatly better than the glibc, why not return a small value for 
stat.blksize and let Lustre manage the different I/O pattern.
Does another code uses the value stat.blksize return by statfs() except 
GlibC ?

-- 
Aurelien Degremont

Andreas Dilger

2007-Feb-15 17:27 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

On Feb 13, 2007  13:22 +0100, Aurelien Degremont wrote:> The behavior is: some of our tools do many fseek-fread(few bytes) on
> large files in huge quantity. Those tools face very poor performances
> regarding to what could have been expected with Lustre traditional
> performance on the same configuration.
> 
> 
> f = fopen()
> pos = somewhere
> do
>   fseek(f, pos)
>   fread(10 bytes, f)
>   pos = somewhere_else
> loop
> 
> 
> Analyzing the behavior, it appears that, for each small fread(), a
> read of 2MB was done instead, by the glibc, this leading to poor
> performance
Ah, interesting.  I''ve seen similar reports, but didn''t
realize the
problem was caused by glibc instead of the application itself using
f_bsize.
> [lustre/llite/llite_lib.c:1567]
> inode->i_blkbits = min(PTLRPC_MAX_BRW_BITS+1, LL_MAX_BLKSIZE_BITS);
> 
> The performances are really good using it when reducing the
> value to 4KB.
> Indeed, Lustre is really better managing and buffering itself the I/O
> than the glibc in those cases. So, we are wondering whether it could not
> be a good idea to reduce the default value, or using something like a
> module param to tune it depending on the kind of I/O on the client.
We haven''t tested the performance of such a change under different
loads, so I can''t really comment.  The original reason that i_blkbits
was increased was because programs like "cp" benefit a great deal
with a larger blocksize.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Oleg Drokin

2007-Feb-18 18:19 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Hello!

On Tue, Feb 13, 2007 at 04:35:26PM +0100, Aurelien Degremont
wrote:> Solofo.Ramangalahy@bull.net wrote:
> >Not sure it is relevant, but there was also some discussion about an
> >fseek glibc optimisation last year :
> >http://lkml.org/lkml/2006/2/28/187
> >"Drastic Slowdown of ''fseek()'' Calls From 2.4 to
2.6 -- VMM Change?"
> This is perfectly relevant because it seems Reiser hit exactly the same 
> issue with ReiserFS (the value jumped from 4K in 2.4 to 128K in 2.6).
In reiserfs the problem was only reported about slow kmail speed, I think.

Also as I remember, at the time decision was that i_blksiz has nothing to do
about reads, actually. It is just a hint for WRITEs to show optimal write
size to avoid partial block writes (and associated speed penalties).

Too bad glibc now adopted this bad practice.
Perhaps somebody can file a bug with glibc people?

Bye,
    Oleg

Aurelien Degremont

2007-Feb-21 03:01 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Oleg Drokin wrote:> Too bad glibc now adopted this bad practice.
> Perhaps somebody can file a bug with glibc people?
Sure it will be a good idea. Can somebody at CFS do this ?

As far as I know, Reiser try dealing this with the glibc people and they 
refused. But, Hans Reiser is not known to be temperate...

And what about workaround for the moment ? What is the solution at short 
term ?


-- 
Aurelien Degremont

Oleg Drokin

2007-Feb-25 19:11 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Hello!

On Wed, Feb 21, 2007 at 11:00:26AM +0100, Aurelien Degremont
wrote:> Oleg Drokin wrote:
> >Too bad glibc now adopted this bad practice.
> >Perhaps somebody can file a bug with glibc people?
> Sure it will be a good idea. Can somebody at CFS do this ?
http://sources.redhat.com/bugzilla/show_bug.cgi?id=4099
> As far as I know, Reiser try dealing this with the glibc people and they 
> refused. But, Hans Reiser is not known to be temperate...
I do not remember there actually was any reply.
> And what about workaround for the moment ? What is the solution at short 
> term ?
Do not use streaming i/o functions for now?
It''s also possible to decrease st_blksize reported by lustre by
patching
lustre.

Bye,
    Oleg

Andreas Dilger

2007-Feb-26 11:49 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

On Feb 25, 2007  21:11 -0500, Oleg Drokin wrote:> > And what about workaround for the moment ? What is the solution at
short
> > term ?
> 
> Do not use streaming i/o functions for now?
> It''s also possible to decrease st_blksize reported by lustre by
patching
> lustre.
With older versions of Lustre if you specified a striping like 65536 bytes
then st_blksize would be returned with stripes * stripe_size.  This caused
problems with NFS export and connectathon because NFS creates files via
mknod() and they don''t have file striping associated until the file is
opened, so the st_blksize could change on a file.

We might want to consider revisiting this change.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Aurelien Degremont

2007-Mar-02 08:30 UTC

head link

[Lustre-devel] i_blksize value in lustre 1.5.97

Oleg Drokin wrote:>>And what about workaround for the moment ? What is the solution at short
>>term ?
> 
> It''s also possible to decrease st_blksize reported by lustre by
patching
> lustre.
Here is a small patch that add a module param that tunes the value 
returned by stat/fstat/lstat syscalls (all calling ll_getattr_it())

By default, Lustre behaves like it actually does, if you change the 
module param ''stat_blksize'', Lustre will return this value as
the
preferred I/O size.
The value could be dynamically change using /proc.
We are using it for days with success. Easy tuning could be done, 
depending on your prefered behaviour.

I could add it to BugZilla if needed.

The patch is against Lustre 1.5.97.

-- 
Aurelien Degremont
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stat_blksize.patch
Type: text/x-patch
Size: 2108 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070302/2a41d071/stat_blksize.bin

Lustre devel - Feb 2007 - i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97

[Lustre-devel] i_blksize value in lustre 1.5.97