Rob Farber
2006-May-19 07:36 UTC
[Lustre-discuss] Resolved: Disappointing streaming read performance in 1.2.1
Hi All, Phil Schwan recommend using a 2M stripe_sz, which makes a tremendous difference in the read performance of 1.2.X. Preliminary results for three runs using a 2M stripe_sz (4 clients with 4 OSTs with DDN 8500 storage) compared against our previous results show: 1 client: 67.31 MB/s 2M stripe_sz Lustre 2.4.20-30.9_lustre.1.2.1_2smp (median 3 runs DDN) 9.62 MB/s 64k Lustre 2.4.20-30.9_lustre.1.2.1_2smp (20 runs DDN)=20 17.76 MB/s Lustre 1.2.1: 2.4.20-28.9 (10 runs using DDN 8500)=20 42.87 MB/s v1.0.4 (1 run using DDN 8500)=20 40.72 MB/s v1.0.4 (6 runs using EMC storage . not DDN 8500)=20 38.82 MB/s v 1.0 (Using EMC storage . not DDN 8500) Read results (73.21 MB/s, 66.88 MB/s, 67.31 MB/s) Write results (101.83 MB/s, 101.85 MB/s, 101.77 MB/s) 4 clients=20 168.55 MB/s 2M stripe_sz Lustre 1.2.1: 2.4.20-30.9_lustre.1.2.1_2smp (median 3 runs) 35.95 MB/s Lustre 1.2.1: 2.4.20-30.9_lustre.1.2.1_2smp (20 runs)=20 36.20 MB/s 64k Lustre 1.2.1: 2.4.20-28.9 (10 runs using DDN 8500)=20 130.46 MB/s v1.0.4 (1 run using DDN 8500)=20 113.79 MB/s v1.0.4 (6 runs using EMC storage . not DDN 8500)=20 127.12 MB/s v 1.0 (Using EMC storage . not DDN 8500) Read results (169.89 MB/s, 166.98 MB/s, 168.55 MB/s) Write results (179.83 MB/s, 177.35 MB/s, 180.12 MB/s) I have had several very useful email conversations with Phil. Here are some excerpts where he addressed my concerns about the 2M stripe_sz: RMF: Is 2M is a good value large clusters (1k+ systems) and general workloads? What are the trade-offs? It is a good value for all sizes. The documentation that you reference is out of date -- we have not recommended 64k to our customers (big or small) for a long time. On our large, 1000+ node clusters, we run with 2MB stripe sizes. RMF: What happened with the performance decrease across versions and does it tell us anything about the Lustre performance envelope? The I/O pipelining and batching code has seen considerable improvement in the months since 1.0.x. We are better at using multiple threads on the servers, and in keeping more data in flight to the disk and the network. The rule of thumb that we use is to make the stripe size at least as large as the optimal network packet size (experimentally determined to be 512k) * the number of packets that we keep in-flight (4, also experimentally chosen). 64k is not good because it takes 32 disjoint chunks of the file to make up that 512k*4 group of network packets. Instead of sending that first optimally-sized packet after the user app has provided 512k of data, we need to either wait for roughly 512k*ost_count to be dirtied, or send a packet which is not optimally sized. Neither outcome is as good as being able to send a 512k packet every time the user provides 512k of data. RMF: Considering large numbers of clients and general workloads, is 2M a *GOOD GENERAL* stripe_sz for 1.2.X? Yes. That summarizes it pretty well for the moment. The moral of the story: use a 2M stripe_sz. Rob