thr3ads.net - Lustre discuss - [Lustre-discuss] how to baseline the performance of a Lustre cluster? [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Theodore Omtzigt

2011-Jul-15 20:30 UTC

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

I got a basic Lustre cluster up and running and did two experiments:

1- using GbE as the interconnect
2- using QDR IB as interconnect

Here are the simple performance results I collected using the pointers 
from the Lustre user guide:

[root at 22-82 ~]# ost-survey /mnt/lustre01
/usr/bin/ost-survey: 06/29/11 OST speed survey on /mnt/lustre01 from 
172.21.22.82 at tcp
Number of Active OST devices : 8
Worst  Read OST indx: 1 speed: 57.704295
Best   Read OST indx: 3 speed: 61.655312
Read Average: 59.785920 +/- 1.245626 MB/s
Worst  Write OST indx: 3 speed: 28.564328
Best   Write OST indx: 5 speed: 70.976016
Write Average: 55.497721 +/- 13.457404 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     59.931       65.722        0.501      0.456
1     57.704       42.436        0.520      0.707
2     60.056       66.074        0.500      0.454
3     61.655       28.564        0.487      1.050
4     58.096       62.979        0.516      0.476
5     59.935       70.976        0.501      0.423
6     61.053       57.724        0.491      0.520
7     59.856       49.507        0.501      0.606

[root at 22-82_ib ~]# ost-survey /mnt/lustre01
/usr/bin/ost-survey: 07/14/11 OST speed survey on /mnt/lustre01 from 
10.1.3.82 at o2ib
Number of Active OST devices : 8
Worst  Read OST indx: 0 speed: 180.625987
Best   Read OST indx: 6 speed: 214.961331
Read Average: 200.478485 +/- 11.408814 MB/s
Worst  Write OST indx: 0 speed: 291.709350
Best   Write OST indx: 6 speed: 496.616135
Write Average: 397.025375 +/- 59.815286 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     180.626       291.709        0.166      0.103
1     206.211       396.815        0.145      0.076
2     207.928       356.645        0.144      0.084
3     197.543       384.335        0.152      0.078
4     206.908       403.361        0.145      0.074
5     205.670       470.235        0.146      0.064
6     214.961       496.616        0.140      0.060
7     183.981       376.487        0.163      0.080

Are these results any good?

To me it looks very disappointing as we can get 3GB/s from the RAID 
controller aggregating a collection of raw SAS drives on the OSTs, and 
we should be able to get a peak of -5GB/s from QDR IB.

First question: is this baseline reasonable?
Second question: what are the tools I can use to better understand the 
Lustre FS behavior to characterize the performance I am getting on the 
client side?

I did check the IB network and I did not record any IB network errors 
during these runs. So I am confident that the IB network was working 
properly.

Looking forward to better understanding Lustre performance.

Tim Carlson

2011-Jul-18 13:28 UTC

head link

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

On Fri, 15 Jul 2011, Theodore Omtzigt wrote:
> To me it looks very disappointing as we can get 3GB/s from the RAID
> controller aggregating a collection of raw SAS drives on the OSTs, and
> we should be able to get a peak of -5GB/s from QDR IB.
>
> First question: is this baseline reasonable?
For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving 
real data. 40Gb/s is the signaling rate and you need to factor in the PCI 
bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. Now 
try and move some data with something like mpi_send and you will see that 
the real amount of data you can send is really more like 24Gb/s or 3GB/s.

The test size for ost_survey is pretty small. 30MB. You can increase that 
with the "-s" flag. Try at least 100MB.

You should also turn of checksums to test raw performance. There is an 
lctl conf_param to do this, but the quick and dirty route on the client is 
the following bash:

for OST in /proc/fs/lustre/osc/*/checksums
do
echo 0 > $OST
done

For comparison sake, on my latest QDR connected Lustre file system with 
LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk 
RAID 6 stripes, I get around 500MB/s write and 350MB/s read using 
ost-survey with 100MB data chunks.

Your numbers seem reasonable.

Tim

-- 
-------------------------------------------
Tim Carlson, PhD
Senior Research Scientist
Environmental Molecular Sciences Laboratory

Kevin Van Maren

2011-Jul-18 14:47 UTC

head link

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

Tim Carlson wrote:> On Fri, 15 Jul 2011, Theodore Omtzigt wrote:
>
>   
>> To me it looks very disappointing as we can get 3GB/s from the RAID
>> controller aggregating a collection of raw SAS drives on the OSTs, and
>> we should be able to get a peak of -5GB/s from QDR IB.
>>
>> First question: is this baseline reasonable?
>>     
>
> For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving 
> real data. 40Gb/s is the signaling rate and you need to factor in the PCI 
> bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. 
Yes, the (unidirectional) bandwidth of QDR 4x IB is 4GB/s, including 
headers, due to the
InfiniBand 8b/10b encoding.  This is the same (raw) data rate as PCIe 
gen2 x8 (which also
uses 8b/10b encoding, to transmit 10bits for every 8-bit byte).

Interestingly, the upcoming InfiniBand "FDR" moves to 64b/66b
encoding,
which eliminates most
of the link overhead.  [8b/10b encoding exists to ensure there are 1) an 
equal number of 1&0 bits,
and 2) to set an upper bounds on the number of sequential 1 or 0 bits at 
a small number.  With
64b/66b there can now be something like 65bits in a row with the same 
value, which makes
it more susceptible to clock skew issues, although the claim is that in 
practice the number
of bits is much smaller as a scrambler is used to "randomize" the
actual
bits, and the sequences
that correspond to 64 1''s or 64 0''s will "never" be
used.  So the
"wrong" data pattern could
cause more problems.]

To clarify, this 4GB/s is reduced to around 3.2GB/s of data primarily 
due to the smaller packet size
of PCIe (256Bytes), where the headers consume quite a bit of the BW, or 
somewhat less when using
128byte PCIe packets.

While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet 
get that high.  As I recall,
something ~2.5 is  more typical.

> Now 
> try and move some data with something like mpi_send and you will see that 
> the real amount of data you can send is really more like 24Gb/s or 3GB/s.
>
> The test size for ost_survey is pretty small. 30MB. You can increase that 
> with the "-s" flag. Try at least 100MB.
>
> You should also turn of checksums to test raw performance. There is an 
> lctl conf_param to do this, but the quick and dirty route on the client is 
> the following bash:
>
> for OST in /proc/fs/lustre/osc/*/checksums
> do
> echo 0 > $OST
> done
>
> For comparison sake, on my latest QDR connected Lustre file system with 
> LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk 
> RAID 6 stripes, I get around 500MB/s write and 350MB/s read using 
> ost-survey with 100MB data chunks.
>
> Your numbers seem reasonable.
>
>
> Tim
>   

Theodore,

You have jumped straight to testing Lustre over the network, without 
first providing
performance numbers for the disks when locally attached.  (You also 
didn''t test the
network, but in the absence of bad links GigE and IB are less variable 
and well understood.)

As for the disk performance, were you able to measure 3GB/s from the 
raid controller, or
what is that number based on?  What was the performance of an individual 
lun (or whatever
backs your OST)?  Are all the OSTs on a single server, and you are testing
them one at a time?

You should be able to get 100+MB/s over GigE, although you may need 2 
OSTs to
do that, and larger IO sizes.  Similarly, if you access multiple OSTs 
simultaneously,
you should be > 2GB/s over o2ib.  At least I am assuming you are using 
o2ib and not
just using tcp over InfiniBand, which would be slower.

Kevin

Christopher J. Morrone

2011-Aug-02 02:06 UTC

head link

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

On 07/18/2011 07:47 AM, Kevin Van Maren wrote:
> While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet
> get that high.  As I recall,
> something ~2.5 is  more typical.
I saw ~3GB/s with lnet-selftest on QDR quite recently.  One run was 
reporting ~3.1GB/s.

Chris

Lustre discuss - Jul 2011 - how to baseline the performance of a Lustre cluster?

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

[Lustre-discuss] how to baseline the performance of a Lustre cluster?

[Lustre-discuss] how to baseline the performance of a Lustre cluster?