Hi, I am interested in how write()s are buffered in Lustre on the cleint, server, and network in between. Specifically I''d like to understand what happens during writes when large number of clients are making large writes to all of the OSTs on an OSS, and the buffers are inadequate to handle the outgoing/incoming data. I know nothing about Lustre''s buffering, can anyone point me to a source of information ? Thanks Burlen
On 2010-08-11, at 23:36, burlen wrote:> I am interested in how write()s are buffered in Lustre on the cleint, > server, and network in between. Specifically I''d like to understand what > happens during writes when large number of clients are making large > writes to all of the OSTs on an OSS, and the buffers are inadequate to > handle the outgoing/incoming data.Lustre doesn''t buffer dirty pages on the OSS, only on the client. The clients are granted a "reserve" of space in each OST filesystem to ensure there is enough free space for any cached writes that they do. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Andreas Dilger wrote:> On 2010-08-11, at 23:36, burlen wrote: > >> I am interested in how write()s are buffered in Lustre on the cleint, >> server, and network in between. Specifically I''d like to understand what >> happens during writes when large number of clients are making large >> writes to all of the OSTs on an OSS, and the buffers are inadequate to >> handle the outgoing/incoming data. >> > > Lustre doesn''t buffer dirty pages on the OSS, only on the client. The clients are granted a "reserve" of space in each OST filesystem to ensure there is enough free space for any cached writes that they do. > >Thanks for your answer. If I understand the way write() typically works on Linux, during a large write(), too large to be buffered in the page cache, once the page cache is full dirty pages would be flushed to disk. the data transfer would block at that point until the dirty pages are written to disk, whence the data transfer would resume into the resulting free pages. But in Lustre I assume that once the client''s page cache is full, the dirty pages are sent over the network to the OSS where they are written to disk. In that case, does the network layer effectively act like a buffer? So that the client may resume the data transfer into the page cache before the former set dirty pages actually hit the disk? Or does the data transfer block until dirty pages actually reach the disk? Thanks Burlen
On 2010-08-12, at 14:52, burlen wrote:> Andreas Dilger wrote: >> On 2010-08-11, at 23:36, burlen wrote: >>> I am interested in how write()s are buffered in Lustre on the cleint, server, and network in between. Specifically I''d like to understand what happens during writes when large number of clients are making large writes to all of the OSTs on an OSS, and the buffers are inadequate to handle the outgoing/incoming data. >> >> Lustre doesn''t buffer dirty pages on the OSS, only on the client. The clients are granted a "reserve" of space in each OST filesystem to ensure there is enough free space for any cached writes that they do. > > If I understand the way write() typically works on Linux, during a large write(), too large to be buffered in the page cache, once the page cache is full dirty pages would be flushed to disk. the data transfer would block at that point until the dirty pages are written to disk, whence the data transfer would resume into the resulting free pages. But in Lustre I assume that once the client''s page cache is full, the dirty pages are sent over the network to the OSS where they are written to disk.In fact, Lustre aggressively flushes dirty data from the client as soon as it can create a 1MB RPC. Otherwise, the VM will cache dirty data for up to 30s, and if you work out that cache for all clients and the aggregate network bandwidth, it would be a huge waste of bandwidth to leave it sitting idle.> In that case, does the network layer effectively act like a buffer? So that the client may resume the data transfer into the page cache before the former set dirty pages actually hit the disk? Or does the data transfer block until dirty pages actually reach the disk?Lustre also limits the dirty page cache per OST far below the VM limits, for similar reasons as above. Clients can have 32MB (default) dirty data per OST, and up to 8 RPCs (default) in flight per OST at one time. The network does NOT act as a buffer, since the client must keep a copy of all {meta}data in memory until it is ACK''d by the server (it is not fire & forget) so that the client can replay this RPC in case of a server crash. The server will send an ACK (RPC reply) when it has processed the RPC along with a transaction number for that RPC, and asynchronously notifies the client that RPCs <= "last_committed_transno" have been committed to disk and they can discard their copy of the RPC. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 2010-08-12, at 15:08, Mark Nelson wrote:> How does the kernel and storage on the OSSes aggregate writes when the number of service threads are increased?The OSS layer does not aggregate writes itself. This is done on the client before the writes RPCs are generated, or in the block device (elevator and/or cache for h/w RAID devices) at the bottom end. There is a research project called "Network Request Scheduler" that aims to submit the IOs in a more coherent order at the OSS thread level, to facilitate block device merging, but it will not explicitly merge the IOs itself.> The Lustre tuning section on the wiki mentions that there are "internal I/O buffers". How are aggregating those writes different than the way the dirty cache on the clients work? > > http://wiki.lustre.org/index.php/Lustre_TuningIn 1.6- there was an explicit 1MB pre-allocated receive buffer for every thread, used to stage a single IO RPC from network RDMA and submit to the block layer. In 1.8+ this 1MB of memory is dynamically allocated from the page cache, at least for the duration of the IO submission, and then depending on /proc tunables (read_cache_enable, writethrough_cache_enable, readcache_max_filesize) it will either discard the page immediately, or keep it in memory and let the VM evict it when there is memory pressure (if not accessed).> On 08/12/2010 12:35 PM, Andreas Dilger wrote: >> On 2010-08-11, at 23:36, burlen wrote: >>> I am interested in how write()s are buffered in Lustre on the cleint, >>> server, and network in between. Specifically I''d like to understand what >>> happens during writes when large number of clients are making large >>> writes to all of the OSTs on an OSS, and the buffers are inadequate to >>> handle the outgoing/incoming data. >> >> Lustre doesn''t buffer dirty pages on the OSS, only on the client. The clients are granted a "reserve" of space in each OST filesystem to ensure there is enough free space for any cached writes that they do. >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > Mark Nelson, Lead Software Developer > Minnesota Supercomputing Institute > Phone: (612)626-4479 > Email: mark at msi.umn.eduCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.