Hi all, our Lustre FS shows an interesting performance problem which I''d like to discuss as some of you might have seen this kind of things before and maybe someone has a quick explanation of what''s going on. We are running Lustre 1.6.5.1. The problem shows up when we read a shared file from multiple nodes that has just been written from the same set of nodes. 512 processes write a checkpoint (1.5 GB from each node) into a shared file by seeking to position RANK*1.5GB and writing 1.5GB in 1.44M chunks. Writing works fine and gives the full file system performance. The data is being written by using write() and no flags aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is terminated and restarted and reads in the same portion of the file. For some reason this almost immediate reading of the same data that was just written on the same node is very slow. If we a) change the set of nodes or b) wait a day, we get the full read performance when we use the same executable and the same shared file. Is there a reason why an immediate read after a write on the same node from/to a shared file is slow? Is there any additional communication, e.g. is the client flushing the buffer cache before the first read? The statistics show that the average time to complete a 1.44MB read request is increasing during the runtime of our program. At some point it hits an upper limit or a saturation point and stays there. Is there some kind of queue or something that is getting full in this kind of write/read-scenario? May tuneable some stuff in /proc/fs/luste? Regards, Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091006/e3c41089/attachment.bin
On Oct 06, 2009 13:24 +0200, Michael Kluge wrote:> We are running Lustre 1.6.5.1. The problem shows up when we read a > shared file from multiple nodes that has just been written from the same > set of nodes. 512 processes write a checkpoint (1.5 GB from each node) > into a shared file by seeking to position RANK*1.5GB and writing 1.5GB > in 1.44M chunks. Writing works fine and gives the full file system > performance. The data is being written by using write() and no flags > aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is > terminated and restarted and reads in the same portion of the file. For > some reason this almost immediate reading of the same data that was just > written on the same node is very slow. If we a) change the set of nodes > or b) wait a day, we get the full read performance when we use the same > executable and the same shared file. > > Is there a reason why an immediate read after a write on the same node > from/to a shared file is slow? Is there any additional communication, > e.g. is the client flushing the buffer cache before the first read? The > statistics show that the average time to complete a 1.44MB read request > is increasing during the runtime of our program. At some point it hits > an upper limit or a saturation point and stays there. Is there some kind > of queue or something that is getting full in this kind of > write/read-scenario? May tuneable some stuff in /proc/fs/luste?One possible issue is that you don''t have enough extra RAM to cache 1.5GB of the checkpoint, so during the write it is being flushed to the OSTs and evicted from cache. When you immediately restart there is still dirty data being written from the clients that is contending with the reads to restart. As a general rule, avoiding unnecessary IO (i.e. reading back data that was just written) reduces the time that the application is not doing useful work (i.e. computing). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Am Dienstag, den 06.10.2009, 09:33 -0600 schrieb Andreas Dilger:> > ... bla bla ... > > Is there a reason why an immediate read after a write on the same node > > from/to a shared file is slow? Is there any additional communication, > > e.g. is the client flushing the buffer cache before the first read? The > > statistics show that the average time to complete a 1.44MB read request > > is increasing during the runtime of our program. At some point it hits > > an upper limit or a saturation point and stays there. Is there some kind > > of queue or something that is getting full in this kind of > > write/read-scenario? May tuneable some stuff in /proc/fs/luste? > > One possible issue is that you don''t have enough extra RAM to cache 1.5GB > of the checkpoint, so during the write it is being flushed to the OSTs > and evicted from cache. When you immediately restart there is still dirty > data being written from the clients that is contending with the reads to > restart. > Cheers, AndreasWell, I do call fsync() after the write is finished. During the write process I see a constant stream of 4 GB/s running from the lustre servers to the raid controllers which finishes when the write process terminates. When I start reading, there are no more writes going this way, so I suspect it might be something else ... Even if I wait between the writes and reads 5 minutes (all dirty pages should have been flushed by then) the picture does not change. Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091007/cd55db10/attachment.bin