thr3ads.net - Lustre discuss - [Lustre-discuss] buffering [Aug 2010]

If this information is useful, please help other people find it:
Share via:

burlen

2010-Aug-12 05:36 UTC

[Lustre-discuss] buffering

Hi,

I am interested in how write()s are buffered in Lustre on the cleint, 
server, and network in between. Specifically I''d like to understand
what
happens during writes when large number of clients are making large 
writes to all of the OSTs on an OSS, and the buffers are inadequate to 
handle the outgoing/incoming data. I know nothing about Lustre''s 
buffering, can anyone point me to a source of information ?

Thanks
Burlen

Andreas Dilger

2010-Aug-12 17:35 UTC

head link

[Lustre-discuss] buffering

On 2010-08-11, at 23:36, burlen wrote:> I am interested in how write()s are buffered in Lustre on the cleint, 
> server, and network in between. Specifically I''d like to
understand what
> happens during writes when large number of clients are making large 
> writes to all of the OSTs on an OSS, and the buffers are inadequate to 
> handle the outgoing/incoming data.
Lustre doesn''t buffer dirty pages on the OSS, only on the client.  The
clients are granted a "reserve" of space in each OST filesystem to
ensure there is enough free space for any cached writes that they do.


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

burlen

2010-Aug-12 20:52 UTC

head link

[Lustre-discuss] buffering

Andreas Dilger wrote:> On 2010-08-11, at 23:36, burlen wrote:
>   
>> I am interested in how write()s are buffered in Lustre on the cleint, 
>> server, and network in between. Specifically I''d like to
understand what
>> happens during writes when large number of clients are making large 
>> writes to all of the OSTs on an OSS, and the buffers are inadequate to 
>> handle the outgoing/incoming data.
>>     
>
> Lustre doesn''t buffer dirty pages on the OSS, only on the client. 
The clients are granted a "reserve" of space in each OST filesystem to
ensure there is enough free space for any cached writes that they do.
>
>   Thanks for your answer.

If I understand the way write() typically works on Linux, during a large 
write(), too large to be buffered in the page cache, once the page cache 
is full dirty pages would be flushed to disk. the data transfer would 
block at that point until the dirty pages are written to disk, whence 
the data transfer would resume into the resulting free pages. But in 
Lustre I assume that once the client''s page cache is full, the dirty 
pages are sent over the network to the OSS where they are written to 
disk. In that case, does the network layer effectively act like a 
buffer? So that the client may resume the data transfer into the page 
cache before the former set dirty pages actually hit the disk? Or does 
the data transfer block until dirty pages actually reach the disk?

Thanks
Burlen

Andreas Dilger

2010-Aug-12 21:40 UTC

head link

[Lustre-discuss] buffering

On 2010-08-12, at 14:52, burlen wrote:> Andreas Dilger wrote:
>> On 2010-08-11, at 23:36, burlen wrote: 
>>> I am interested in how write()s are buffered in Lustre on the
cleint, server, and network in between. Specifically I''d like to
understand what happens during writes when large number of clients are making
large writes to all of the OSTs on an OSS, and the buffers are inadequate to
handle the outgoing/incoming data.
>> 
>> Lustre doesn''t buffer dirty pages on the OSS, only on the
client.  The clients are granted a "reserve" of space in each OST
filesystem to ensure there is enough free space for any cached writes that they
do.
> 
> If I understand the way write() typically works on Linux, during a large
write(), too large to be buffered in the page cache, once the page cache is full
dirty pages would be flushed to disk. the data transfer would block at that
point until the dirty pages are written to disk, whence the data transfer would
resume into the resulting free pages.  But in Lustre I assume that once the
client''s page cache is full, the dirty pages are sent over the network
to the OSS where they are written to disk.
In fact, Lustre aggressively flushes dirty data from the client as soon as it
can create a 1MB RPC.  Otherwise, the VM will cache dirty data for up to 30s,
and if you work out that cache for all clients and the aggregate network
bandwidth, it would be a huge waste of bandwidth to leave it sitting idle.
> In that case, does the network layer effectively act like a buffer? So that
the client may resume the data transfer into the page cache before the former
set dirty pages actually hit the disk? Or does the data transfer block until
dirty pages actually reach the disk?
Lustre also limits the dirty page cache per OST far below the VM limits, for
similar reasons as above.  Clients can have 32MB (default) dirty data per OST,
and up to 8 RPCs (default) in flight per OST at one time.

The network does NOT act as a buffer, since the client must keep a copy of all
{meta}data in memory until it is ACK''d by the server (it is not fire
& forget) so that the client can replay this RPC in case of a server crash. 
The server will send an ACK (RPC reply) when it has processed the RPC along with
a transaction number for that RPC, and asynchronously notifies the client that
RPCs <= "last_committed_transno" have been committed to disk and
they can discard their copy of the RPC.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Andreas Dilger

2010-Aug-12 21:53 UTC

head link

[Lustre-discuss] buffering

On 2010-08-12, at 15:08, Mark Nelson wrote:> How does the kernel and storage on the OSSes aggregate writes when the
number of service threads are increased?
The OSS layer does not aggregate writes itself.  This is done on the client
before the writes RPCs are generated, or in the block device (elevator and/or
cache for h/w RAID devices) at the bottom end.

There is a research project called "Network Request Scheduler" that
aims to submit the IOs in a more coherent order at the OSS thread level, to
facilitate block device merging, but it will not explicitly merge the IOs
itself.
> The Lustre tuning section on the wiki mentions that there are
"internal I/O buffers".  How are aggregating those writes different
than the way the dirty cache on the clients work?
> 
> http://wiki.lustre.org/index.php/Lustre_Tuning
In 1.6- there was an explicit 1MB pre-allocated receive buffer for every thread,
used to stage a single IO RPC from network RDMA and submit to the block layer. 
In 1.8+ this 1MB of memory is dynamically allocated from the page cache, at
least for the duration of the IO submission, and then depending on /proc
tunables (read_cache_enable,  writethrough_cache_enable, readcache_max_filesize)
it will either discard the page immediately, or keep it in memory and let the VM
evict it when there is memory pressure (if not accessed).
> On 08/12/2010 12:35 PM, Andreas Dilger wrote:
>> On 2010-08-11, at 23:36, burlen wrote:
>>> I am interested in how write()s are buffered in Lustre on the
cleint,
>>> server, and network in between. Specifically I''d like to
understand what
>>> happens during writes when large number of clients are making large
>>> writes to all of the OSTs on an OSS, and the buffers are inadequate
to
>>> handle the outgoing/incoming data.
>> 
>> Lustre doesn''t buffer dirty pages on the OSS, only on the
client.  The clients are granted a "reserve" of space in each OST
filesystem to ensure there is enough free space for any cached writes that they
do.
>> 
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> -- 
> Mark Nelson, Lead Software Developer
> Minnesota Supercomputing Institute
> Phone: (612)626-4479
> Email: mark at msi.umn.edu

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Lustre discuss - Aug 2010 - buffering

[Lustre-discuss] buffering

[Lustre-discuss] buffering

[Lustre-discuss] buffering

[Lustre-discuss] buffering

[Lustre-discuss] buffering