Jure Pečar
2011-Jul-14 14:29 UTC
[Lustre-discuss] bursty instead of even write performance
Hello all, This is my first meet with decently sized lustre storage. I have a system to play with here, consisting of 10 HP DL320 with 12x750GB drives each, attached to SmartArray controllers. Machines are few years old and possibly underpowered, equipped with xeon 3060 and 2GB of RAM each. Each OSS is capable of about 120MB/s write and 240MB/s read. MDS is equally weak by today''s standards, a DL140 with xeon 5110 and only a gig of RAM. OSSs are connected via bonded GigE to a switch that has 10GigE connection to other switches, through which clients connect. I managed to get lustre 2.0.58 to show signs of life on this system, using SL6 and about a week of trial-and-error compiling. It was sporiadic to mount and actually start working, but about every third attempt produced a working lustre fs that clients could mount. Since mngmt tools were mostly useless, I decided to go back to 1.8 and chose 1.8.6-wc, which worked out-of-the-box on CentOS5. Then I went and did some tests with performance and here is my first question. My clients are two racks of IBM machines (84 of them). I''m getting about 700MB/s write and 1.5GB/s read combined, which feels great for my first lustre. However I''m noticing some strange patterns when looking at ganglia graphs. graph.gif shows combined write speed when each node was simply writing a large file using dd. Performance slowly drops as the disks get full, something I''m used to. But nodes.gif shows write speed of each node, which shows things I''m not used to - long periods of no activity, then sudden bursts, then again nothing. I would assume each client to have a steady and even write activity, if only at 8-9 MB/s, but that''s not what I see. So, my question: Is what I''m observing an expected situation? Or am I right that I should be seeing more balanced write activity from each node? Since all of the lustre settings are at their defaults, what should I look into to see if I can tune anything? Thanks for pointers, -- Jure Pe?ar http://jure.pecar.org -------------- next part -------------- A non-text attachment was scrubbed... Name: graph.gif Type: image/gif Size: 122005 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/be906df7/attachment-0002.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: nodes.gif Type: image/gif Size: 33979 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/be906df7/attachment-0003.gif
Peter Grandi
2011-Jul-25 13:25 UTC
[Lustre-discuss] bursty instead of even write performance
> I have a system to play with here, consisting of 10 HP DL320 > with 12x750GB drives each, attached to SmartArray controllers.That''s good because it is a lot of disk arms (Quite a few people seem to underestimate IOPS and go for fewer larger drives). Those are reasonable systems for IO especially if they are configured for a non-demented RAID layout (low chances of that of course :->).> Machines are few years old and possibly underpowered, equipped > with xeon 3060 and 2GB of RAM each. Each OSS is capable of > about 120MB/s write and 240MB/s read.If that''s 12x750GB drives doing 120MB aggregate writes, ah well, the "non-demented RAID layout" is indeed a forlorn hope.> MDS is equally weak by today''s standards, a DL140 with xeon > 5110 and only a gig of RAM.Note necessarily, that''s a newer style machine (there seems to be a really large difference in IO performance between pre-PCIe chipsets and PCIe ones), as the 5110 implies it is a G4, and it is a PCIe class machine. The 1GiB is too small, but what really matters is the high IOPS storage system, and you don''t say what it is like. Perhaps you could use one of the DL320s or a slice of one (or two, so you get a backup MDS).> OSSs are connected via bonded GigE to a switchBonded is often a bad idea, depending on which type of bonded, sometimes dual independent interfaces is better.> that has 10GigE connection to other switches, through which > clients connect.This seems to mean that the server switch has multiple 10Gb/s connections, else the aggregate read rate of 1.5GB/s is hard to explain. In the write case you need to carefully look at switch setup and Linux network setup on the servers, as you have a situation where incoming traffic on those 10Gb/s links gets distributed to a set of 20x 1Gb/s ports with lower aggregate capacity, and lower individual speeds.> My clients are two racks of IBM machines (84 of them). I''m > getting about 700MB/s write and 1.5GB/s read combined, [ ... ]That''s curious because Lustre usually does better at writing than reading in simple benchmarks, especially for concurrentm read/write. Perhaps the OST RAID setup could be revised :-), and the switch and Linux network setups reviewed. Your numbers come from 120 drives and 20 Ethernet interfaces across servers for a total of around 70-80GB capacity and 2-2.2GB/s network transfer rate, and while the measured per-drive and per-interface numbers do not seem high, the key here is "combined" and that means that there is potential a significant rate of seeking, so overall they seem reasonable. BTW I assume that "combined" here means "concurrent", as in able to do 700MB/s writes and 1500MB/s reads *at the same time*, and not "aggregate", as in across the 84 clients, but only reading or only writing.> graph.gif shows combined write speed when each node was simply > writing a large file using dd. Performance slowly drops as the > disks get full, something I''m used to.It is remarkable that you have profiled this, as I have noticed that many people (e.g. GSI some years ago :->) seem surprised that outer-inner track speed differencesa (and fragmentation) mean that near-full disks are way slower than near-empty disks. In your case speed perhaps should drop a lot more, typical disks are 2x slower inner tracks than outer tracks, which probably means that there is some limitation as to taking advantage of the speed of the outer tracks. But the 700MB/s write speed here seems "aggregate" rather than "concurrent" as there is negligible reads driving up IOPS, and the writes are purely sequential, and this rate is somewhat disappointing, as that implies around 35MB/s per 1Gb/s interfaces and> But nodes.gif shows write speed of each node, which shows > things I''m not used to - long periods of no activity, then > sudden bursts, then again nothing.Why? Writing is heavily subject to buffering, both at the Linux level and the Lustre level, and flushes happen rarely, often with pretty bad consequences. With the default Linux etc. setup it happens pretty often that some GB of "dirty" pages accumulate in memory and then get "burped" out in one go congesting the storage system.> I would assume each client to have a steady and even write > activity, if only at 8-9 MB/s, but that''s not what I see.Each OSS with 12x 750GB (Oeach of which capable of average transfer rates of around 50MB/s) disks should be doing a few hundred MB/s. Ahh I now realize that ''nodes.gif'' is the transfer rate of each *client* node, not each *server* (OSS) node. In that case maximum transfer rates of 20MB/s and averages which are much lower when writing to 120 server disks, or roughly 1-1.5 server disks per node are not that awesome.> So, my question: Is what I''m observing an expected situation?Depends on what you are observing, as you are not clear as to what you are measuring. For example you don''t state clearly where the ''dd'' is running and which parameters (e.g. ''bs'' and ''iflag'' and ''oflag'') are. Presumably it is running on multiple client nodes otherwise you would not be getting 700-500MB/s aggregate (unless the client node had a 10Gb/s interface), and the "nodes" you refer to seem to be the client nodes (more than 10 graphs). Maybe it would be interesting to measure with something like this: dd bs=1M count=10000 if=/dev/zero conv=fdatasync of=/lus/TEST10G the speed of one OST mounted as ''ldiskfs'', locally on the OSS, for both optimal case read and write as in the above, to check the upper bounds. Then try the same on a single client with a 1Gb/s interface, ideally a single client with a a 10Gb/s interface, and then 10 clients with a 1Gb/s interface (same number of clients as OSSes), and then 20 clients with a a 1Gb/s interface (same number of clients as total server interfaces), and then 40 clients with a 1Gb/s interface (more clients than server interfaces or servers, and 3 server disks per client, which should be able to deliver close to 1Gb/s between them).> Or am I right that I should be seeing more balanced write > activity from each node?Well, depends on how much write buffering you configured, explicitly or implicitly, in the Linux flusher and in Lustre. But the big deal is not that you are getting bursty IO, but that the numbers involved are not that awesome for 120 disks and 20 network interfaces across the servers. The 20 network interfaces mean that each server can''t do more than 200-220MB/s in/out but since the 12 disks per server should get you a lot more than that you should be getting close to max utilization of those network interfaces. Perhaps given that anyhow each OSS is limited to 200-220MB/s by the 2 network interfaces you could reconfigure your storage system to take advantage of that and go for lower local peak transfer rate :-) and aim at latency and higher IOPS as you have many clients (but that''s perhaps a different discussion).> Since all of the lustre settings are at their defaults, what > should I look into to see if I can tune anything?There are a few tuning guides with various settings, and discussions in this mailing list, in particular as to RPCs. I just did a web search with: +lustre +rpc write buffering rates OR speed OR performance and got several relevant hits, e.g.: http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/lustre-hpc-technical%20bulletin-dell-cambridge-03022011.pdf http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf In particular the Dell UK HPC people are doing valuable work with Lustre, and their findings are more generally applicable than their kit (which BTW I quite like).