Theodore Omtzigt
2011-Jul-15 20:30 UTC
[Lustre-discuss] how to baseline the performance of a Lustre cluster?
I got a basic Lustre cluster up and running and did two experiments: 1- using GbE as the interconnect 2- using QDR IB as interconnect Here are the simple performance results I collected using the pointers from the Lustre user guide: [root at 22-82 ~]# ost-survey /mnt/lustre01 /usr/bin/ost-survey: 06/29/11 OST speed survey on /mnt/lustre01 from 172.21.22.82 at tcp Number of Active OST devices : 8 Worst Read OST indx: 1 speed: 57.704295 Best Read OST indx: 3 speed: 61.655312 Read Average: 59.785920 +/- 1.245626 MB/s Worst Write OST indx: 3 speed: 28.564328 Best Write OST indx: 5 speed: 70.976016 Write Average: 55.497721 +/- 13.457404 MB/s Ost# Read(MB/s) Write(MB/s) Read-time Write-time ---------------------------------------------------- 0 59.931 65.722 0.501 0.456 1 57.704 42.436 0.520 0.707 2 60.056 66.074 0.500 0.454 3 61.655 28.564 0.487 1.050 4 58.096 62.979 0.516 0.476 5 59.935 70.976 0.501 0.423 6 61.053 57.724 0.491 0.520 7 59.856 49.507 0.501 0.606 [root at 22-82_ib ~]# ost-survey /mnt/lustre01 /usr/bin/ost-survey: 07/14/11 OST speed survey on /mnt/lustre01 from 10.1.3.82 at o2ib Number of Active OST devices : 8 Worst Read OST indx: 0 speed: 180.625987 Best Read OST indx: 6 speed: 214.961331 Read Average: 200.478485 +/- 11.408814 MB/s Worst Write OST indx: 0 speed: 291.709350 Best Write OST indx: 6 speed: 496.616135 Write Average: 397.025375 +/- 59.815286 MB/s Ost# Read(MB/s) Write(MB/s) Read-time Write-time ---------------------------------------------------- 0 180.626 291.709 0.166 0.103 1 206.211 396.815 0.145 0.076 2 207.928 356.645 0.144 0.084 3 197.543 384.335 0.152 0.078 4 206.908 403.361 0.145 0.074 5 205.670 470.235 0.146 0.064 6 214.961 496.616 0.140 0.060 7 183.981 376.487 0.163 0.080 Are these results any good? To me it looks very disappointing as we can get 3GB/s from the RAID controller aggregating a collection of raw SAS drives on the OSTs, and we should be able to get a peak of -5GB/s from QDR IB. First question: is this baseline reasonable? Second question: what are the tools I can use to better understand the Lustre FS behavior to characterize the performance I am getting on the client side? I did check the IB network and I did not record any IB network errors during these runs. So I am confident that the IB network was working properly. Looking forward to better understanding Lustre performance.
Tim Carlson
2011-Jul-18 13:28 UTC
[Lustre-discuss] how to baseline the performance of a Lustre cluster?
On Fri, 15 Jul 2011, Theodore Omtzigt wrote:> To me it looks very disappointing as we can get 3GB/s from the RAID > controller aggregating a collection of raw SAS drives on the OSTs, and > we should be able to get a peak of -5GB/s from QDR IB. > > First question: is this baseline reasonable?For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving real data. 40Gb/s is the signaling rate and you need to factor in the PCI bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. Now try and move some data with something like mpi_send and you will see that the real amount of data you can send is really more like 24Gb/s or 3GB/s. The test size for ost_survey is pretty small. 30MB. You can increase that with the "-s" flag. Try at least 100MB. You should also turn of checksums to test raw performance. There is an lctl conf_param to do this, but the quick and dirty route on the client is the following bash: for OST in /proc/fs/lustre/osc/*/checksums do echo 0 > $OST done For comparison sake, on my latest QDR connected Lustre file system with LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk RAID 6 stripes, I get around 500MB/s write and 350MB/s read using ost-survey with 100MB data chunks. Your numbers seem reasonable. Tim -- ------------------------------------------- Tim Carlson, PhD Senior Research Scientist Environmental Molecular Sciences Laboratory
Kevin Van Maren
2011-Jul-18 14:47 UTC
[Lustre-discuss] how to baseline the performance of a Lustre cluster?
Tim Carlson wrote:> On Fri, 15 Jul 2011, Theodore Omtzigt wrote: > > >> To me it looks very disappointing as we can get 3GB/s from the RAID >> controller aggregating a collection of raw SAS drives on the OSTs, and >> we should be able to get a peak of -5GB/s from QDR IB. >> >> First question: is this baseline reasonable? >> > > For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving > real data. 40Gb/s is the signaling rate and you need to factor in the PCI > bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat.Yes, the (unidirectional) bandwidth of QDR 4x IB is 4GB/s, including headers, due to the InfiniBand 8b/10b encoding. This is the same (raw) data rate as PCIe gen2 x8 (which also uses 8b/10b encoding, to transmit 10bits for every 8-bit byte). Interestingly, the upcoming InfiniBand "FDR" moves to 64b/66b encoding, which eliminates most of the link overhead. [8b/10b encoding exists to ensure there are 1) an equal number of 1&0 bits, and 2) to set an upper bounds on the number of sequential 1 or 0 bits at a small number. With 64b/66b there can now be something like 65bits in a row with the same value, which makes it more susceptible to clock skew issues, although the claim is that in practice the number of bits is much smaller as a scrambler is used to "randomize" the actual bits, and the sequences that correspond to 64 1''s or 64 0''s will "never" be used. So the "wrong" data pattern could cause more problems.] To clarify, this 4GB/s is reduced to around 3.2GB/s of data primarily due to the smaller packet size of PCIe (256Bytes), where the headers consume quite a bit of the BW, or somewhat less when using 128byte PCIe packets. While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet get that high. As I recall, something ~2.5 is more typical.> Now > try and move some data with something like mpi_send and you will see that > the real amount of data you can send is really more like 24Gb/s or 3GB/s. > > The test size for ost_survey is pretty small. 30MB. You can increase that > with the "-s" flag. Try at least 100MB. > > You should also turn of checksums to test raw performance. There is an > lctl conf_param to do this, but the quick and dirty route on the client is > the following bash: > > for OST in /proc/fs/lustre/osc/*/checksums > do > echo 0 > $OST > done > > For comparison sake, on my latest QDR connected Lustre file system with > LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk > RAID 6 stripes, I get around 500MB/s write and 350MB/s read using > ost-survey with 100MB data chunks. > > Your numbers seem reasonable. > > > Tim >Theodore, You have jumped straight to testing Lustre over the network, without first providing performance numbers for the disks when locally attached. (You also didn''t test the network, but in the absence of bad links GigE and IB are less variable and well understood.) As for the disk performance, were you able to measure 3GB/s from the raid controller, or what is that number based on? What was the performance of an individual lun (or whatever backs your OST)? Are all the OSTs on a single server, and you are testing them one at a time? You should be able to get 100+MB/s over GigE, although you may need 2 OSTs to do that, and larger IO sizes. Similarly, if you access multiple OSTs simultaneously, you should be > 2GB/s over o2ib. At least I am assuming you are using o2ib and not just using tcp over InfiniBand, which would be slower. Kevin
Christopher J. Morrone
2011-Aug-02 02:06 UTC
[Lustre-discuss] how to baseline the performance of a Lustre cluster?
On 07/18/2011 07:47 AM, Kevin Van Maren wrote:> While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet > get that high. As I recall, > something ~2.5 is more typical.I saw ~3GB/s with lnet-selftest on QDR quite recently. One run was reporting ~3.1GB/s. Chris