Tanin
2011-May-19 23:29 UTC
[Lustre-discuss] Two questions about the tuning of Lustre file system.
Dear all, I have two question regarding the performance of Lustre System. Currently, we have 5 OSS nodes, and each OSS carries 8 OST''s. All the nodes (including the MDT/MGS node and client node) are connected to a Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The bandwidth of the network is 40Gbps. The kernel version is ''Linux 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux''. OS is RHEL 5.5. Lustre version is 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 ConnectX VPI PCIe IB QDR. And I did a simple test on the client side to see the peak data reading performance. Here is the data: #time Data transferred Bandwidth 2 sec 2.18 GBytes 8.71 Gbits/sec 2 sec 2.06 GBytes 8.24 Gbits/sec 2 sec 2.10 GBytes 8.40 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 1.50 GBytes 6.02 Gbits/sec 2 sec 420.00 MBytes 1.64 Gbits/sec 2 sec 2.19 GBytes 8.75 Gbits/sec 2 sec 2.08 GBytes 8.32 Gbits/sec 2 sec 2.08 GBytes 8.32 Gbits/sec 2 sec 1.99 GBytes 7.97 Gbits/sec 2 sec 1.80 GBytes 7.19 Gbits/sec *2 sec 160.00 MBytes 640.00 Mbits/sec* 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.13 GBytes 8.52 Gbits/sec 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.09 GBytes 8.36 Gbits/sec 2 sec 2.09 GBytes 8.36 Gbits/sec 2 sec 2.07 GBytes 8.28 Gbits/sec 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.11 GBytes 8.44 Gbits/sec 2 sec 2.05 GBytes 8.20 Gbits/sec *2 sec 0.00 Bytes 0.00 bits/sec* *2 sec 0.00 Bytes 0.00 bits/sec* 2 sec 1.95 GBytes 7.81 Gbits/sec 2 sec 2.14 GBytes 8.55 Gbits/sec 2 sec 1.99 GBytes 7.97 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 370.00 MBytes 1.45 Gbits/sec 2 sec 1.96 GBytes 7.85 Gbits/sec 2 sec 2.03 GBytes 8.12 Gbits/sec 2 sec 1.89 GBytes 7.58 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 640.00 MBytes 2.50 Gbits/sec 2 sec 1.47 GBytes 5.90 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 1.90 GBytes 7.62 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 1.18 GBytes 4.73 Gbits/sec 2 sec 940.00 MBytes 3.67 Gbits/sec 2 sec 1.97 GBytes 7.89 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 1.87 GBytes 7.46 Gbits/sec 2 sec 1.77 GBytes 7.07 Gbits/sec 2 sec 320.00 MBytes 1.25 Gbits/sec 2 sec 1.97 GBytes 7.89 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 1.89 GBytes 7.58 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 350.00 MBytes 1.37 Gbits/sec 2 sec 1.77 GBytes 7.07 Gbits/sec 2 sec 1.92 GBytes 7.70 Gbits/sec 2 sec 2.05 GBytes 8.20 Gbits/sec 2 sec 2.01 GBytes 8.05 Gbits/sec 2 sec 710.00 MBytes 2.77 Gbits/sec 2 sec 1.59 GBytes 6.37 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 710.00 MBytes 2.77 Gbits/sec 2 sec 1.59 GBytes 6.37 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 1.88 GBytes 7.54 Gbits/sec 2 sec 1.62 GBytes 6.48 Gbits/sec As you can see, although the peak bandwidth can reach 8.71Gbps, the performance is quite unstable(sometimes the bandwidth just gets chocked). All the OSS node seems to stop reading data simultaneously. I tried to group up different OSTs and turn on/off the checksum, this still happens. Does anybody get a hint of the reason? 2. As we know, when reading data from lustre client, the data is moved from OSS disk to its memory, and then send to the lustre client. Except for O_DIRECT, is there any other configuration to optimize the disk data access, such as using sendfile, splice or fio, which can greatly expedite the disk data access? fio: http://freshmeat.net/projects/fio/ Any help will be greatly appreciated. Thanks! -- Best regards, ----------------------------------------------------------------------------------------------- Li, Tan PhD Candidate & Research Assistant, Electrical Engineering, Stony Brook University, NY Personal Web Site: https://sites.google.com/site/homepagelitan/Home Email: fanqielee at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110519/bacf8679/attachment.html
Kevin Van Maren
2011-May-20 15:23 UTC
[Lustre-discuss] Two questions about the tuning of Lustre file system.
What exactly were you testing? I have no idea how to interpret your numbers. A single client reading from a single file? One file per OST, or file striped across all OSTs? Is the Lustre file system idle except for your test? In general, start with the pieces: 1) make sure the network is sane. Try measuring BW to/from each node (client and server) to ensure all the cables are good. For your configuration, you should be able to measure ~3.2GB/s (unidirectional) using large MPI messages. While I prefer to use MPI, some people use the lnet_selftest. 2) make sure each OST is sane. For each OST, create a file that is only striped on that OST. Make sure a client can read/write each of these files as expected. Be sure you transfer much more data than the client+server RAM sizes. Many issues are sorted out just getting both 1 & 2 in good shape. Kevin Tanin wrote:> Dear all, > > I have two question regarding the performance of Lustre System. > Currently, we have 5 OSS nodes, and each OSS carries 8 OST''s. All the > nodes (including the MDT/MGS node and client node) are connected to a > Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The > bandwidth of the network is 40Gbps. The kernel version is ''Linux > 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 > x86_64 x86_64 x86_64 GNU/Linux''. OS is RHEL 5.5. Lustre version is > 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 > ConnectX VPI PCIe IB QDR. > > And I did a simple test on the client side to see the peak data > reading performance. Here is the data: > > #time Data transferred Bandwidth > 2 sec 2.18 GBytes 8.71 Gbits/sec > 2 sec 2.06 GBytes 8.24 Gbits/sec > 2 sec 2.10 GBytes 8.40 Gbits/sec > 2 sec 1.93 GBytes 7.73 Gbits/sec > 2 sec 1.50 GBytes 6.02 Gbits/sec > 2 sec 420.00 MBytes 1.64 Gbits/sec > 2 sec 2.19 GBytes 8.75 Gbits/sec > 2 sec 2.08 GBytes 8.32 Gbits/sec > 2 sec 2.08 GBytes 8.32 Gbits/sec > 2 sec 1.99 GBytes 7.97 Gbits/sec > 2 sec 1.80 GBytes 7.19 Gbits/sec > *2 sec 160.00 MBytes 640.00 Mbits/sec* > 2 sec 2.15 GBytes 8.59 Gbits/sec > 2 sec 2.13 GBytes 8.52 Gbits/sec > 2 sec 2.15 GBytes 8.59 Gbits/sec > 2 sec 2.09 GBytes 8.36 Gbits/sec > 2 sec 2.09 GBytes 8.36 Gbits/sec > 2 sec 2.07 GBytes 8.28 Gbits/sec > 2 sec 2.15 GBytes 8.59 Gbits/sec > 2 sec 2.11 GBytes 8.44 Gbits/sec > 2 sec 2.05 GBytes 8.20 Gbits/sec > *2 sec 0.00 Bytes 0.00 bits/sec* > *2 sec 0.00 Bytes 0.00 bits/sec* > 2 sec 1.95 GBytes 7.81 Gbits/sec > 2 sec 2.14 GBytes 8.55 Gbits/sec > 2 sec 1.99 GBytes 7.97 Gbits/sec > 2 sec 2.00 GBytes 8.01 Gbits/sec > 2 sec 370.00 MBytes 1.45 Gbits/sec > 2 sec 1.96 GBytes 7.85 Gbits/sec > 2 sec 2.03 GBytes 8.12 Gbits/sec > 2 sec 1.89 GBytes 7.58 Gbits/sec > 2 sec 1.94 GBytes 7.77 Gbits/sec > 2 sec 640.00 MBytes 2.50 Gbits/sec > 2 sec 1.47 GBytes 5.90 Gbits/sec > 2 sec 1.94 GBytes 7.77 Gbits/sec > 2 sec 1.90 GBytes 7.62 Gbits/sec > 2 sec 1.94 GBytes 7.77 Gbits/sec > 2 sec 1.18 GBytes 4.73 Gbits/sec > 2 sec 940.00 MBytes 3.67 Gbits/sec > 2 sec 1.97 GBytes 7.89 Gbits/sec > 2 sec 1.93 GBytes 7.73 Gbits/sec > 2 sec 1.87 GBytes 7.46 Gbits/sec > 2 sec 1.77 GBytes 7.07 Gbits/sec > 2 sec 320.00 MBytes 1.25 Gbits/sec > 2 sec 1.97 GBytes 7.89 Gbits/sec > 2 sec 2.00 GBytes 8.01 Gbits/sec > 2 sec 1.89 GBytes 7.58 Gbits/sec > 2 sec 1.93 GBytes 7.73 Gbits/sec > 2 sec 350.00 MBytes 1.37 Gbits/sec > 2 sec 1.77 GBytes 7.07 Gbits/sec > 2 sec 1.92 GBytes 7.70 Gbits/sec > 2 sec 2.05 GBytes 8.20 Gbits/sec > 2 sec 2.01 GBytes 8.05 Gbits/sec > 2 sec 710.00 MBytes 2.77 Gbits/sec > 2 sec 1.59 GBytes 6.37 Gbits/sec > 2 sec 2.00 GBytes 8.01 Gbits/sec > 2 sec 710.00 MBytes 2.77 Gbits/sec > 2 sec 1.59 GBytes 6.37 Gbits/sec > 2 sec 2.00 GBytes 8.01 Gbits/sec > 2 sec 1.88 GBytes 7.54 Gbits/sec > 2 sec 1.62 GBytes 6.48 Gbits/sec > > > As you can see, although the peak bandwidth can reach 8.71Gbps, the > performance is quite unstable(sometimes the bandwidth just gets > chocked). All the OSS node seems to stop reading data simultaneously. > I tried to group up different OSTs and turn on/off the checksum, this > still happens. Does anybody get a hint of the reason? > > 2. As we know, when reading data from lustre client, the data is moved > from OSS disk to its memory, and then send to the lustre client. > Except for O_DIRECT, is there any other configuration to optimize the > disk data access, such as using sendfile, splice or fio, which can > greatly expedite the disk data access? > > fio: http://freshmeat.net/projects/fio/ > > Any help will be greatly appreciated. Thanks! > > > > -- > Best regards, > > ----------------------------------------------------------------------------------------------- > Li, Tan > PhD Candidate & Research Assistant, > Electrical Engineering, > Stony Brook University, NY > > Personal Web Site: https://sites.google.com/site/homepagelitan/Home > > Email: fanqielee at gmail.com <mailto:fanqielee at gmail.com> > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >