thr3ads.net - Lustre discuss - [Lustre-discuss] Read/Write performance problem [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Michael Kluge

2009-Oct-06 11:24 UTC

[Lustre-discuss] Read/Write performance problem

Hi all,

our Lustre FS shows an interesting performance problem which I''d like
to
discuss as some of you might have seen this kind of things before and
maybe someone has a quick explanation of what''s going on.

We are running Lustre 1.6.5.1. The problem shows up when we read a
shared file from multiple nodes that has just been written from the same
set of nodes. 512 processes write a checkpoint (1.5 GB from each node)
into a shared file by seeking to position RANK*1.5GB and writing 1.5GB
in 1.44M chunks. Writing works fine and gives the full file system
performance. The data is being written by using write() and no flags
aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is
terminated and restarted and reads in the same portion of the file. For
some reason this almost immediate reading of the same data that was just
written on the same node is very slow. If we a) change the set of nodes
or b) wait a day, we get the full read performance when we use the same
executable and the same shared file.

Is there a reason why an immediate read after a write on the same node
from/to a shared file is slow? Is there any additional communication,
e.g. is the client flushing the buffer cache before the first read? The
statistics show that the average time to complete a 1.44MB read request
is increasing during the runtime of our program. At some point it hits
an upper limit or a saturation point and stays there. Is there some kind
of queue or something that is getting full in this kind of
write/read-scenario? May tuneable some stuff in /proc/fs/luste?

Regards, Michael

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW: http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091006/e3c41089/attachment.bin

Andreas Dilger

2009-Oct-06 15:33 UTC

head link

[Lustre-discuss] Read/Write performance problem

On Oct 06, 2009  13:24 +0200, Michael Kluge wrote:> We are running Lustre 1.6.5.1. The problem shows up when we read a
> shared file from multiple nodes that has just been written from the same
> set of nodes. 512 processes write a checkpoint (1.5 GB from each node)
> into a shared file by seeking to position RANK*1.5GB and writing 1.5GB
> in 1.44M chunks. Writing works fine and gives the full file system
> performance. The data is being written by using write() and no flags
> aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is
> terminated and restarted and reads in the same portion of the file. For
> some reason this almost immediate reading of the same data that was just
> written on the same node is very slow. If we a) change the set of nodes
> or b) wait a day, we get the full read performance when we use the same
> executable and the same shared file. 
> 
> Is there a reason why an immediate read after a write on the same node
> from/to a shared file is slow? Is there any additional communication,
> e.g. is the client flushing the buffer cache before the first read? The
> statistics show that the average time to complete a 1.44MB read request
> is increasing during the runtime of our program. At some point it hits
> an upper limit or a saturation point and stays there. Is there some kind
> of queue or something that is getting full in this kind of
> write/read-scenario? May tuneable some stuff in /proc/fs/luste?
One possible issue is that you don''t have enough extra RAM to cache
1.5GB
of the checkpoint, so during the write it is being flushed to the OSTs
and evicted from cache.  When you immediately restart there is still dirty
data being written from the clients that is contending with the reads to
restart.

As a general rule, avoiding unnecessary IO (i.e. reading back data that
was just written) reduces the time that the application is not doing
useful work (i.e. computing).


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Michael Kluge

2009-Oct-07 05:59 UTC

head link

[Lustre-discuss] Read/Write performance problem

Am Dienstag, den 06.10.2009, 09:33 -0600 schrieb Andreas
Dilger:> > ... bla bla ...
> > Is there a reason why an immediate read after a write on the same node
> > from/to a shared file is slow? Is there any additional communication,
> > e.g. is the client flushing the buffer cache before the first read?
The
> > statistics show that the average time to complete a 1.44MB read
request
> > is increasing during the runtime of our program. At some point it hits
> > an upper limit or a saturation point and stays there. Is there some
kind
> > of queue or something that is getting full in this kind of
> > write/read-scenario? May tuneable some stuff in /proc/fs/luste?
> 
> One possible issue is that you don''t have enough extra RAM to
cache 1.5GB
> of the checkpoint, so during the write it is being flushed to the OSTs
> and evicted from cache.  When you immediately restart there is still dirty
> data being written from the clients that is contending with the reads to
> restart.
> Cheers, Andreas
Well, I do call fsync() after the write is finished. During the write
process I see a constant stream of 4 GB/s running from the lustre
servers to the raid controllers which finishes when the write process
terminates. When I start reading, there are no more writes going this
way, so I suspect it might be something else ... Even if I wait between
the writes and reads 5 minutes (all dirty pages should have been flushed
by then) the picture does not change.


Michael

-- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091007/cd55db10/attachment.bin

Lustre discuss - Oct 2009 - Read/Write performance problem

[Lustre-discuss] Read/Write performance problem

[Lustre-discuss] Read/Write performance problem

[Lustre-discuss] Read/Write performance problem