thr3ads.net - Lustre discuss - [Lustre-discuss] O

If this information is useful, please help other people find it:
Share via:

McCrory, Duane

2006-May-19 07:36 UTC

[Lustre-discuss] O_DIRECT

Sky, Phil,

I''ve started observing the behavior of Lustre 1.0.4. These observations
may be relevant to this discussion.=20

I have a single OST and a single client.

I observe that the block transfer requests coming from the sockNAL are always 4K
bytes in size even though the application is doing 1MB writes "dd
if=3D/dev/zero of=3D/mnt/lustre/file500M bs=3D1M count=3D512". Similar
behavior is observed with reads. My guess is that if O_DIRECT is set that the
kernel/Lustre doesn''t break the transfer into 4KB chunks, rather it
does the chunk size requested by the application (this improves efficiency
significantly).=20

The effective bandwidth of very high speed interconnects like InfiniBand (IB)
goes way up as the transfer size increases (e.g. it only takes 4 microseconds on
the wire to transfer a 4KB message), whereas the control overhead in
Lustre/sockNAL/TCP/IPoIB is much longer than 4 microseconds. Therefore, the
effective bandwidth is limited by the control overhead.

I know that if we can get Lustre to issue larger IO requests (to an OST) that we
can significantly increase the observed throughput of the Lustre filesystem over
high speed interconnects like InfiniBand.

I''ve also had some experience measuring IO performance on linux. Tools
like bonnie++ (and especially iometer for Linux) can report misleading IO
performance numbers due to their inability to bypass the local disk cache (if
there is an easy way to configure Lustre to always bypass the local disk cache
on the client machine I''d love to use it). Right now, I''m
running my Lustre Lite client on a machine restricted to 256MB of memory (using
mem=3D256M in boot) in order to reduce the likelihood of pages being cached in
the local disk cache (I''m also transferring 500MB files to reduce
likelihood of cache hits).

Duane.
-----Original Message-----
From: Phil Schwan [mailto:phil@clusterfs.com]
Sent: Friday, June 11, 2004 7:14 PM
To: sky
Cc: lustre-discuss@lists.clusterfs.com
Subject: Re: [Lustre-discuss] O_DIRECT

On Fri, 2004-05-28 at 00:06, sky wrote:> I had performance trouble agaist IPoIB o IA64.It''s very
interesting that bonnie with=20
> O_DIRECT option achive write rate at about 90~100MB/s, but 20~30MB/s
without o_direct.
> We have 2 OST with SCSI 320 disk on IA64.
> It seems that data flowing into page cache have been pushed out of cache
slowly.
> I have not read lustre source code seriously.I suspet that IPoIB or
infiniband stack=20
> affect lustre cache operation. It''s just my opnion.
> I don''t konw if IB nal under developing have expericed same
problem.Is there any way to
> ajust lustre cache ?
What I find most surprising is that you get such good performance with
O_DIRECT at all.  bonnie++, at least the way we run it, is a very
metadata-intensive test, and it''s these metadata portions which are
usually the bottleneck.

The difference between O_DIRECT and the page cache is the size and
number of RPCs.

With O_DIRECT, the API requires that we send the contents of your
write() to the server immediately, and not return until it''s there.  We
don''t batch it together with other writes, and we can only have 1 RPC
in
flight at a time per thread.

When writing through the page cache, we take efforts to keep many RPCs
in flight at all times (4 or 8 by default, depending on your Lustre
version), and also to make those RPCs 512k in size.  Both of these
variables are tunable, for each OSC, in /proc/fs/lustre/osc/

You could try tuning max_pages_per_rpc down to something smaller, or
max_rpcs_in_flight.  Given what was written to the list earlier today
about the window sizes in the IPoIB stack, this might just work.

Please let us know what you learn!

Thanks--

-Phil

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] O_DIRECT

On Fri, 2004-05-28 at 00:06, sky wrote:> I had performance trouble agaist IPoIB o IA64.It''s very
interesting that bonnie with
> O_DIRECT option achive write rate at about 90~100MB/s, but 20~30MB/s
without o_direct.
> We have 2 OST with SCSI 320 disk on IA64.
> It seems that data flowing into page cache have been pushed out of cache
slowly.
> I have not read lustre source code seriously.I suspet that IPoIB or
infiniband stack
> affect lustre cache operation. It''s just my opnion.
> I don''t konw if IB nal under developing have expericed same
problem.Is there any way to
> ajust lustre cache ?
What I find most surprising is that you get such good performance with
O_DIRECT at all.  bonnie++, at least the way we run it, is a very
metadata-intensive test, and it''s these metadata portions which are
usually the bottleneck.

The difference between O_DIRECT and the page cache is the size and
number of RPCs.

With O_DIRECT, the API requires that we send the contents of your
write() to the server immediately, and not return until it''s there.  We
don''t batch it together with other writes, and we can only have 1 RPC
in
flight at a time per thread.

When writing through the page cache, we take efforts to keep many RPCs
in flight at all times (4 or 8 by default, depending on your Lustre
version), and also to make those RPCs 512k in size.  Both of these
variables are tunable, for each OSC, in /proc/fs/lustre/osc/

You could try tuning max_pages_per_rpc down to something smaller, or
max_rpcs_in_flight.  Given what was written to the list earlier today
about the window sizes in the IPoIB stack, this might just work.

Please let us know what you learn!

Thanks--

-Phil

Eric Barton

2006-May-19 07:36 UTC

head link

[Lustre-discuss] O_DIRECT

Duane,
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of McCrory,
> Duane
> Sent: Saturday, June 12, 2004 4:32 AM
> To: Phil Schwan; sky
> Cc: lustre-discuss@lists.clusterfs.com
> Subject: RE: [Lustre-discuss] O_DIRECT
> 
> 
> Sky, Phil,
> 
> I''ve started observing the behavior of Lustre 1.0.4. These 
> observations may be relevant to this discussion. 
> 
> I have a single OST and a single client.
> 
> I observe that the block transfer requests coming from the 
> sockNAL are always 4K bytes in size even though the 
> application is doing 1MB writes "dd if=/dev/zero 
> of=/mnt/lustre/file500M bs=1M count=512". Similar behavior is 
> observed with reads. My guess is that if O_DIRECT is set that 
> the kernel/Lustre doesn''t break the transfer into 4KB chunks, 
> rather it does the chunk size requested by the application 
> (this improves efficiency significantly). 
This is a "feature" of the socknal, that it only gives 1 fragment
at a time to the network.  However it _does_ set the ''more''
flag
on all sends apart from the very last fragment, which allows the linux
TCP/IP stack to aggregate these fragments into full MTU packets.

I noticed that this was also a problem for the Voltaire SDP implementation,
and so I produced a patch that passes a full-sized iovec on writes at the
expense of always saving and reconstructing the iovec (partial writes leave
the iovec in an unknown state since there is no guarantee that the
caller''s
iovec will or won''t be ''consumed''; we''ve
seen both behaviours!).


                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

sky

2006-May-19 07:36 UTC

head link

[Lustre-discuss] O_DIRECT

<html><pre><PRE>hi, everyone<BR><!--  -->I had
performance trouble agaist IPoIB o IA64.It's very interesting that bonnie
with <BR><!--  -->O_DIRECT option achive write rate at about
90~100MB/s, but 20~30MB/s without o_direct.<BR><!--  -->We have 2
OST with SCSI 320 disk on IA64.<BR><!--  -->It seems that data
flowing into page cache have been pushed out of cache slowly.<BR><!-- 
-->I have not read lustre source code seriously.I suspet that IPoIB or
infiniband stack <BR><!--  -->affect lustre cache operation.
It's just my opnion.<BR><!--  -->I don't konw if IB nal
under developing have expericed same problem.Is there any way
to<BR><!--  -->ajust lustre cache ?<BR><!-- 
--><BR><!--  --><BR><!--  --><BR><!-- 
--><BR><!--  -->kane.</PRE></pre></html>





=========================263µç×ÓÓÊ¼þ£ÐÅÀµÓÊ×Ô×¨Òµ

Lustre discuss - May 2006 - O_DIRECT

[Lustre-discuss] O_DIRECT

[Lustre-discuss] O_DIRECT

[Lustre-discuss] O_DIRECT

[Lustre-discuss] O_DIRECT