thr3ads.net - Lustre discuss - [Lustre-discuss] Resolved: Disappointing streaming read performance in 1.2.1 [May 2006]

If this information is useful, please help other people find it:
Share via:

Rob Farber

2006-May-19 07:36 UTC

[Lustre-discuss] Resolved: Disappointing streaming read performance in 1.2.1

Hi All,

Phil Schwan recommend using a 2M stripe_sz, which makes a tremendous
difference in the read performance of 1.2.X. Preliminary results for three
runs using a 2M stripe_sz (4 clients with 4 OSTs with DDN 8500 storage)
compared against our previous results show:

1 client:
67.31 MB/s 2M stripe_sz Lustre 2.4.20-30.9_lustre.1.2.1_2smp (median 3 runs
DDN)
9.62 MB/s 64k Lustre 2.4.20-30.9_lustre.1.2.1_2smp (20 runs DDN)=20
17.76 MB/s Lustre 1.2.1: 2.4.20-28.9 (10 runs using DDN 8500)=20
42.87 MB/s v1.0.4 (1 run using DDN 8500)=20
40.72 MB/s v1.0.4 (6 runs using EMC storage . not DDN 8500)=20
38.82 MB/s v 1.0 (Using EMC storage . not DDN 8500)

Read results (73.21 MB/s, 66.88 MB/s, 67.31 MB/s)
Write results (101.83 MB/s, 101.85 MB/s, 101.77 MB/s)

4 clients=20

168.55 MB/s 2M stripe_sz Lustre 1.2.1: 2.4.20-30.9_lustre.1.2.1_2smp (median
3 runs)
35.95 MB/s Lustre 1.2.1: 2.4.20-30.9_lustre.1.2.1_2smp (20 runs)=20
36.20 MB/s 64k Lustre 1.2.1: 2.4.20-28.9 (10 runs using DDN 8500)=20
130.46 MB/s v1.0.4 (1 run using DDN 8500)=20
113.79 MB/s v1.0.4 (6 runs using EMC storage . not DDN 8500)=20
127.12 MB/s   v 1.0 (Using EMC storage . not DDN 8500)

Read results (169.89 MB/s, 166.98 MB/s, 168.55 MB/s)
Write results (179.83 MB/s, 177.35 MB/s, 180.12 MB/s)


I have had several very useful email conversations with Phil. Here are some
excerpts where he addressed my concerns about the 2M stripe_sz:

RMF: Is 2M is a good value large clusters (1k+ systems) and general
workloads? What are the trade-offs?

It is a good value for all sizes.  The documentation that you reference is
out of date -- we have not recommended 64k to our customers (big or small)
for a long time.  On our large, 1000+ node clusters, we run with 2MB stripe
sizes.

RMF: What happened with the performance decrease across versions and does it
tell us anything about the Lustre performance envelope?

The I/O pipelining and batching code has seen considerable improvement in
the months since 1.0.x.  We are better at using multiple threads on the
servers, and in keeping more data in flight to the disk and the network.

The rule of thumb that we use is to make the stripe size at least as large
as the optimal network packet size (experimentally determined to be 512k) *
the number of packets that we keep in-flight (4, also experimentally
chosen).

64k is not good because it takes 32 disjoint chunks of the file to make up
that 512k*4 group of network packets.  Instead of sending that first
optimally-sized packet after the user app has provided 512k of data, we need
to either wait for roughly 512k*ost_count to be dirtied, or send a packet
which is not optimally sized.  Neither outcome is as good as being able to
send a 512k packet every time the user provides 512k of data.

RMF: Considering large numbers of clients and general workloads, is 2M a
*GOOD GENERAL* stripe_sz for 1.2.X?

Yes.

That summarizes it pretty well for the moment. The moral of the story: use a
2M stripe_sz.

Rob

Lustre discuss - May 2006 - Resolved: Disappointing streaming read performance in 1.2.1

[Lustre-discuss] Resolved: Disappointing streaming read performance in 1.2.1