thr3ads.net - Lustre discuss - [Lustre-discuss] Two questions about the tuning of Lustre file system. [May 2011]

If this information is useful, please help other people find it:
Share via:

Tanin

2011-May-19 23:29 UTC

[Lustre-discuss] Two questions about the tuning of Lustre file system.

Dear all,

I have two question regarding the performance of Lustre System. Currently,
we have 5 OSS nodes, and each OSS carries 8 OST''s. All the nodes
(including
the MDT/MGS node and client node) are connected to a Mellanox MTS 3600
InfiniBand switch using RDMA for data transfer. The bandwidth of the
network is 40Gbps. The kernel version is  ''Linux
2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 x86_64
x86_64 x86_64 GNU/Linux''. OS is RHEL 5.5.  Lustre version is 1.8.3.
OFED
Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 ConnectX VPI PCIe
IB QDR.

And I did a simple test on the client side to see the peak data reading
performance. Here is the data:

#time          Data transferred  Bandwidth
2 sec           2.18 GBytes     8.71 Gbits/sec
2 sec           2.06 GBytes     8.24 Gbits/sec
2 sec           2.10 GBytes     8.40 Gbits/sec
2 sec           1.93 GBytes     7.73 Gbits/sec
2 sec           1.50 GBytes     6.02 Gbits/sec
2 sec           420.00 MBytes   1.64 Gbits/sec
2 sec           2.19 GBytes     8.75 Gbits/sec
2 sec           2.08 GBytes     8.32 Gbits/sec
2 sec           2.08 GBytes     8.32 Gbits/sec
2 sec           1.99 GBytes     7.97 Gbits/sec
2 sec           1.80 GBytes     7.19 Gbits/sec
*2 sec           160.00 MBytes   640.00 Mbits/sec*
2 sec           2.15 GBytes     8.59 Gbits/sec
2 sec           2.13 GBytes     8.52 Gbits/sec
2 sec           2.15 GBytes     8.59 Gbits/sec
2 sec           2.09 GBytes     8.36 Gbits/sec
2 sec           2.09 GBytes     8.36 Gbits/sec
2 sec           2.07 GBytes     8.28 Gbits/sec
2 sec           2.15 GBytes     8.59 Gbits/sec
2 sec           2.11 GBytes     8.44 Gbits/sec
2 sec           2.05 GBytes     8.20 Gbits/sec
*2 sec           0.00 Bytes      0.00 bits/sec*
*2 sec           0.00 Bytes      0.00 bits/sec*
2 sec           1.95 GBytes     7.81 Gbits/sec
2 sec           2.14 GBytes     8.55 Gbits/sec
2 sec           1.99 GBytes     7.97 Gbits/sec
2 sec           2.00 GBytes     8.01 Gbits/sec
2 sec           370.00 MBytes   1.45 Gbits/sec
2 sec           1.96 GBytes     7.85 Gbits/sec
2 sec           2.03 GBytes     8.12 Gbits/sec
2 sec           1.89 GBytes     7.58 Gbits/sec
2 sec           1.94 GBytes     7.77 Gbits/sec
2 sec           640.00 MBytes   2.50 Gbits/sec
2 sec           1.47 GBytes     5.90 Gbits/sec
2 sec           1.94 GBytes     7.77 Gbits/sec
2 sec           1.90 GBytes     7.62 Gbits/sec
2 sec           1.94 GBytes     7.77 Gbits/sec
2 sec           1.18 GBytes     4.73 Gbits/sec
2 sec           940.00 MBytes   3.67 Gbits/sec
2 sec           1.97 GBytes     7.89 Gbits/sec
2 sec           1.93 GBytes     7.73 Gbits/sec
2 sec           1.87 GBytes     7.46 Gbits/sec
2 sec           1.77 GBytes     7.07 Gbits/sec
2 sec           320.00 MBytes   1.25 Gbits/sec
2 sec           1.97 GBytes     7.89 Gbits/sec
2 sec           2.00 GBytes     8.01 Gbits/sec
2 sec           1.89 GBytes     7.58 Gbits/sec
2 sec           1.93 GBytes     7.73 Gbits/sec
2 sec           350.00 MBytes   1.37 Gbits/sec
2 sec           1.77 GBytes     7.07 Gbits/sec
2 sec           1.92 GBytes     7.70 Gbits/sec
2 sec           2.05 GBytes     8.20 Gbits/sec
2 sec           2.01 GBytes     8.05 Gbits/sec
2 sec           710.00 MBytes   2.77 Gbits/sec
2 sec           1.59 GBytes     6.37 Gbits/sec
2 sec           2.00 GBytes     8.01 Gbits/sec
2 sec           710.00 MBytes   2.77 Gbits/sec
2 sec           1.59 GBytes     6.37 Gbits/sec
2 sec           2.00 GBytes     8.01 Gbits/sec
2 sec           1.88 GBytes     7.54 Gbits/sec
2 sec           1.62 GBytes     6.48 Gbits/sec


As you can see, although the peak bandwidth can reach 8.71Gbps, the
performance is quite unstable(sometimes the bandwidth just gets chocked).
All the OSS node seems to stop reading data simultaneously. I tried to group
up different OSTs and turn on/off the checksum, this still happens. Does
anybody get a hint of the reason?

2. As we know, when reading data from lustre client, the data is moved from
OSS disk to its memory, and then send to the lustre client. Except for
O_DIRECT, is there any other configuration to optimize the  disk data
access, such as using sendfile, splice or fio, which can greatly expedite
the disk data access?

fio: http://freshmeat.net/projects/fio/

Any help will be greatly appreciated. Thanks!



-- 
Best regards,

-----------------------------------------------------------------------------------------------
Li, Tan
PhD Candidate & Research Assistant,
Electrical Engineering,
Stony Brook University, NY

Personal Web Site: https://sites.google.com/site/homepagelitan/Home

Email: fanqielee at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110519/bacf8679/attachment.html

Kevin Van Maren

2011-May-20 15:23 UTC

head link

[Lustre-discuss] Two questions about the tuning of Lustre file system.

What exactly were you testing?  I have no idea how to interpret your 
numbers.  A single client reading from a single file?  One file per OST, 
or file striped across all OSTs?  Is the Lustre file system idle except 
for your test?

In general, start with the pieces:
1) make sure the network is sane.  Try measuring BW to/from each node 
(client and server) to ensure all the cables are good.  For your 
configuration, you should be able to measure ~3.2GB/s (unidirectional) 
using large MPI messages.  While I prefer to use MPI, some people use 
the lnet_selftest.
2) make sure each OST is sane.  For each OST, create a file that is only 
striped on that OST.  Make sure a client can read/write each of these 
files as expected.  Be sure you transfer much more data than the 
client+server RAM sizes.

Many issues are sorted out just getting both 1 & 2 in good shape.

Kevin



Tanin wrote:> Dear all, 
>
> I have two question regarding the performance of Lustre System. 
> Currently, we have 5 OSS nodes, and each OSS carries 8 OST''s. All
the
> nodes (including the MDT/MGS node and client node) are connected to a 
> Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The 
> bandwidth of the network is 40Gbps. The kernel version is  ''Linux 
> 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 
> x86_64 x86_64 x86_64 GNU/Linux''. OS is RHEL 5.5.  Lustre version
is
> 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 
> ConnectX VPI PCIe IB QDR.
>
> And I did a simple test on the client side to see the peak data 
> reading performance. Here is the data:
>
> #time          Data transferred  Bandwidth
> 2 sec           2.18 GBytes     8.71 Gbits/sec
> 2 sec           2.06 GBytes     8.24 Gbits/sec
> 2 sec           2.10 GBytes     8.40 Gbits/sec
> 2 sec           1.93 GBytes     7.73 Gbits/sec
> 2 sec           1.50 GBytes     6.02 Gbits/sec
> 2 sec           420.00 MBytes   1.64 Gbits/sec
> 2 sec           2.19 GBytes     8.75 Gbits/sec
> 2 sec           2.08 GBytes     8.32 Gbits/sec
> 2 sec           2.08 GBytes     8.32 Gbits/sec
> 2 sec           1.99 GBytes     7.97 Gbits/sec
> 2 sec           1.80 GBytes     7.19 Gbits/sec
> *2 sec           160.00 MBytes   640.00 Mbits/sec*
> 2 sec           2.15 GBytes     8.59 Gbits/sec
> 2 sec           2.13 GBytes     8.52 Gbits/sec
> 2 sec           2.15 GBytes     8.59 Gbits/sec
> 2 sec           2.09 GBytes     8.36 Gbits/sec
> 2 sec           2.09 GBytes     8.36 Gbits/sec
> 2 sec           2.07 GBytes     8.28 Gbits/sec
> 2 sec           2.15 GBytes     8.59 Gbits/sec
> 2 sec           2.11 GBytes     8.44 Gbits/sec
> 2 sec           2.05 GBytes     8.20 Gbits/sec
> *2 sec           0.00 Bytes      0.00 bits/sec*
> *2 sec           0.00 Bytes      0.00 bits/sec*
> 2 sec           1.95 GBytes     7.81 Gbits/sec
> 2 sec           2.14 GBytes     8.55 Gbits/sec
> 2 sec           1.99 GBytes     7.97 Gbits/sec
> 2 sec           2.00 GBytes     8.01 Gbits/sec
> 2 sec           370.00 MBytes   1.45 Gbits/sec
> 2 sec           1.96 GBytes     7.85 Gbits/sec
> 2 sec           2.03 GBytes     8.12 Gbits/sec
> 2 sec           1.89 GBytes     7.58 Gbits/sec
> 2 sec           1.94 GBytes     7.77 Gbits/sec
> 2 sec           640.00 MBytes   2.50 Gbits/sec
> 2 sec           1.47 GBytes     5.90 Gbits/sec
> 2 sec           1.94 GBytes     7.77 Gbits/sec
> 2 sec           1.90 GBytes     7.62 Gbits/sec
> 2 sec           1.94 GBytes     7.77 Gbits/sec
> 2 sec           1.18 GBytes     4.73 Gbits/sec
> 2 sec           940.00 MBytes   3.67 Gbits/sec
> 2 sec           1.97 GBytes     7.89 Gbits/sec
> 2 sec           1.93 GBytes     7.73 Gbits/sec
> 2 sec           1.87 GBytes     7.46 Gbits/sec
> 2 sec           1.77 GBytes     7.07 Gbits/sec
> 2 sec           320.00 MBytes   1.25 Gbits/sec
> 2 sec           1.97 GBytes     7.89 Gbits/sec
> 2 sec           2.00 GBytes     8.01 Gbits/sec
> 2 sec           1.89 GBytes     7.58 Gbits/sec
> 2 sec           1.93 GBytes     7.73 Gbits/sec
> 2 sec           350.00 MBytes   1.37 Gbits/sec
> 2 sec           1.77 GBytes     7.07 Gbits/sec
> 2 sec           1.92 GBytes     7.70 Gbits/sec
> 2 sec           2.05 GBytes     8.20 Gbits/sec
> 2 sec           2.01 GBytes     8.05 Gbits/sec
> 2 sec           710.00 MBytes   2.77 Gbits/sec
> 2 sec           1.59 GBytes     6.37 Gbits/sec
> 2 sec           2.00 GBytes     8.01 Gbits/sec
> 2 sec           710.00 MBytes   2.77 Gbits/sec
> 2 sec           1.59 GBytes     6.37 Gbits/sec
> 2 sec           2.00 GBytes     8.01 Gbits/sec
> 2 sec           1.88 GBytes     7.54 Gbits/sec
> 2 sec           1.62 GBytes     6.48 Gbits/sec
>
>
> As you can see, although the peak bandwidth can reach 8.71Gbps, the 
> performance is quite unstable(sometimes the bandwidth just gets 
> chocked). All the OSS node seems to stop reading data simultaneously. 
> I tried to group up different OSTs and turn on/off the checksum, this 
> still happens. Does anybody get a hint of the reason?
>
> 2. As we know, when reading data from lustre client, the data is moved 
> from OSS disk to its memory, and then send to the lustre client. 
> Except for O_DIRECT, is there any other configuration to optimize the 
>  disk data access, such as using sendfile, splice or fio, which can 
> greatly expedite the disk data access?
>
> fio: http://freshmeat.net/projects/fio/
>
> Any help will be greatly appreciated. Thanks!
>
>
>
> -- 
> Best regards,
>  
>
-----------------------------------------------------------------------------------------------
> Li, Tan
> PhD Candidate & Research Assistant,
> Electrical Engineering,
> Stony Brook University, NY
>
> Personal Web Site: https://sites.google.com/site/homepagelitan/Home
>
> Email: fanqielee at gmail.com <mailto:fanqielee at gmail.com>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - May 2011 - Two questions about the tuning of Lustre file system.

[Lustre-discuss] Two questions about the tuning of Lustre file system.

[Lustre-discuss] Two questions about the tuning of Lustre file system.