thr3ads.net - Lustre discuss - [Lustre-discuss] Slow read performance across OSSes [Oct 2009]

If this information is useful, please help other people find it:
Share via:

James Robnett

2009-Oct-13 15:16 UTC

[Lustre-discuss] Slow read performance across OSSes

I''ve put together a small test Lustre system which is giving
confusing (at least to me) results.  All nodes are running fully
patched 64bit RHEL 5.3 with the premade Lustre 1.8.1 x86_64 RPMs.

   The nodes are a bit cobbled together from what I had handy.

One MDS: 8 core 2.5 GHz nihalem 8GB RAM single E1000 gigabit NIC
          MDT is just a partition on a 1TB SAS Seagate
Two OSS: Dual core 2.8GHz Xeon, 4GB RAM single E1000 gigabit NIC
          Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA
          drives.
Two OST/OSS: 2TB. Configured as LVM.  1 and 4MB stripe size tried.
Client:  8 core 2.5 GHz Xeon, 8GB ram single Broadcaom gigabit NIC
Network:  Dedicated Cisco 2960g Gigabit switch

   This gives 2 OSSes, 4 OSTs of 2TB each for a total of 8TB.  I''ve
tried 1MB and 4MB stripes.

   Using Bonnie++ 1.03b (-f -s24g) from the client I see decent numbers
when  reading/writing to any single OST (94 and 112 MB/s write/read).  I
see slightly better numbers using 2 OST''s on the same OSS (98 and
115MB/s write/read).

   When I use any 2 OSTs across two OSSes or all 4 OSTs I see a
distinct fall off in read rates.  In that case I get full 115MB/s
writes but only 40MB/s reads.  This holds true for striping that uses
any combination of OSTs which utilize both OSSes.

   All the data rates are about what I''d expect given the subsystems
and gigabit ethernet but those very slow reads confuse me.  I expect
slightly slower (say 80-90 MB/s) reads due to buffer issues but
not 40.

   With iostat I see relatively sustained read rates on each OSTs
volume as opposed to full reads, wait, full reads, wait which seems to 
imply the client is the one setting the pace but I''m confused why
the client is so slow reassembling replies from two streams from
2 OSS''s and not 2 streams from one OSS.

   I''ve tried 1MB and 4MB stripe sizes, I''ve tried increasing
the
RX ring on OSSes to 4096, I''ve tried disabling checksums.  Not
surprisingly nothing seemed to have any effect since each OSS can
easily handle the client requests on its own.

   I have *not* applied the patches that address the potential corruption
issue in 1.8.x.  I saw no evidence they really applied in this case.

   I''ve searched through this list but haven''t seen anything
that seems
equivalent.  I feel I must have missed something simple on the client
side but am at my wits end what that is.

   Thanks in advance for any insight as to what I''m missing.

James Robnett
NRAO/NM

James Robnett

2009-Oct-14 20:15 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

After reading through my first post I felt some clarification was
probably warranted.

   In this test setup there are two OSS, call them OSS-1 and OSS-2,
each has an OST, call them OSS-1-A, OSS-1-B and OSS-2-A, OSS-2-B.

   The MDS, OSSes and client all have 1Gbit ethernet connections.

   The following table illustrates the data rates I see in MB/s.

OST(s)                                 Read    Write
OSS-1-A                                 113      95
OSS-1-B                                 112      93
OSS-1-A OSS-1-B                         112      98
OSS-2-A                                 105      93
OSS-2-B                                 115      94
OSS-2-A OSS-2-B                         115      98
OSS-1-B OSS-2-A                     ---> 42     113
OSS-1-A OSS-2-B                     ---> 42     114
OSS-1-A OSS-1-B OSS-2-A OSS-2-B     ---> 46     114

   The write numbers are almost exactly what I''d expect across
1Gbit.  96MB/s or so between the client and a single OSS and
nearly full rate (112MB/s) with two OSSes.

   The 113MB/s read numbers for a single OSS (one or more OST''s) are
also pretty much exactly what I''d expect.  It''s the 40MB/s
reads
when utilizing 2 OSSes that are throwing me.

   I can envision that there would be more re-assembly overhead on
the client in the case of 2 OSSes(1) but I''m surprised it''s
that high.

   Is this an expected result ?

   If it''s unexpected is there a common misconfiguration or client
short coming that causes it to be slower when reading from multiple
OSSes ?

   Is there some command I could run or data I could provide that would
help identify the issue ?  I''m fairly new to Lustre so I''m
just as
likely to flood noise as signal if I just randomly appended data
beyond raw rates.

   I just upgraded to 1.8.1.1 which had no effect.

James Robnett
NRAO/NM

1) I''m assuming in the case of a single OSS with 2 OSTs the OSS
presents the client with a single stream.  If assembly of two data
streams is required on the client in both the single and dual OSS
(both with 2 OSTs) cases then I''m even more confused about those
results.

James Robnett wrote:>    The nodes are a bit cobbled together from what I had handy.
>
> One MDS: Dual quad-core 2.5GHz nehalem 8GB RAM  E1000 gigabit NIC
>           MDT is just a partition on a 1TB SAS Seagate
> Two OSS: Single dual core 2.8GHz Xeon, 4GB RAM single gigabit NIC
>           Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA
>           drives.
> Two OST/OSS: 2TB. Configured as LVM.  1 and 4MB stripe size tried.
> Client:  Dual quad-core 2.5 GHz Xeon, 8GB RAM single gigabit NIC
> Network:  Dedicated Cisco 2960g Gigabit switch

Andreas Dilger

2009-Oct-17 17:53 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

On 14-Oct-09, at 14:15, James Robnett wrote:>  After reading through my first post I felt some clarification was
> probably warranted.
>
>  In this test setup there are two OSS, call them OSS-1 and OSS-2,
> each has an OST, call them OSS-1-A, OSS-1-B and OSS-2-A, OSS-2-B.
>
>  The MDS, OSSes and client all have 1Gbit ethernet connections.
>
>  The following table illustrates the data rates I see in MB/s.
>
> OST(s)                                 Read    Write
> OSS-1-A                                 113      95
> OSS-1-B                                 112      93
> OSS-1-A OSS-1-B                         112      98
> OSS-2-A                                 105      93
> OSS-2-B                                 115      94
> OSS-2-A OSS-2-B                         115      98
> OSS-1-B OSS-2-A                     ---> 42     113
> OSS-1-A OSS-2-B                     ---> 42     114
> OSS-1-A OSS-1-B OSS-2-A OSS-2-B     ---> 46     114
You''re sure that there isn''t some other strange effect here,
like you
are only measuring the speed of a single iozone thread or similar?
>  I can envision that there would be more re-assembly overhead on
> the client in the case of 2 OSSes(1) but I''m surprised
it''s that high.
>
>  Is this an expected result ?
>
>  If it''s unexpected is there a common misconfiguration or client
> short coming that causes it to be slower when reading from multiple
> OSSes?
This is definitely NOT expected, and I''m puzzled as to why this might  
be.
>  Is there some command I could run or data I could provide that would
> help identify the issue ?  I''m fairly new to Lustre so
I''m just as
> likely to flood noise as signal if I just randomly appended data
> beyond raw rates.
You could check /proc/fs/lustre/obdfilter/*/brw_stats on the  
respective OSTs
to see if the client is not assembling the RPCs very well for some  
reason.
Alternately, it might be that you have configured the disk storage of  
OSS-1
and OSS-2 to compete (e.g. different partitions sharing the same disks).
> 1) I''m assuming in the case of a single OSS with 2 OSTs the OSS
> presents the client with a single stream.  If assembly of two data
> streams is required on the client in both the single and dual OSS
> (both with 2 OSTs) cases then I''m even more confused about those
> results.
No, the client needs to assemble the OST objects itself, regardless of
whether the OSTs are on the same OSS or not.  The file should be striped
over all of the OSTs involved in the test.
> James Robnett wrote:
>>  The nodes are a bit cobbled together from what I had handy.
>>
>> One MDS: Dual quad-core 2.5GHz nehalem 8GB RAM  E1000 gigabit NIC
>>         MDT is just a partition on a 1TB SAS Seagate
>> Two OSS: Single dual core 2.8GHz Xeon, 4GB RAM single gigabit NIC
>>         Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA
>>         drives.
>> Two OST/OSS: 2TB. Configured as LVM.  1 and 4MB stripe size tried.
>> Client:  Dual quad-core 2.5 GHz Xeon, 8GB RAM single gigabit NIC
>> Network:  Dedicated Cisco 2960g Gigabit switch

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

James Robnett

2009-Oct-18 00:58 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

Many thanks for the reply Andreas.
> You''re sure that there isn''t some other strange effect
here, like you
> are only measuring the speed of a single iozone thread or similar?
  I''m just looking at the output from Bonnie++ running on the client.

  I see corresponding numbers when examining iostat on each OST.  The sum
of all iostats from each OST in use matches the bonnie++ numbers.

  Can Bonnie be at fault ?  I''ve only been setting test size. 
I''ll try
iozone to see if it returns similar results.
> This is definitely NOT expected, and I''m puzzled as to why this
might
> be.
   Considering how ''stock'' this should be, ie RHEL 5.3 with
Sun provided
RPMS I must be doing something wrong or more folks would see it but I''m
dipped if I know what it is.  Everything works, no errors, just slow for
multiple OSSes.
> You could check /proc/fs/lustre/obdfilter/*/brw_stats on the
> respective OSTs
> to see if the client is not assembling the RPCs very well for some
> reason.
   I ran two instances of bonnie++, the first used OST0000 and OST0001
on OSS1, the second used OST0001 on OSS1 and OST0002 on OSS2.  I rebooted
between each run to reset the stats.

The contents of  /proc/fs/lustre/obdfilter/lustre-OST0001/brw_stats
look essentially identical in both runs even thought the read rate in the
first was 114MB and in the second 38MB/s.  I''ve appended the read
portion
of both files below.

   Not sure exactly what I should be looking for in those stats.  I''m
also
curious how it could be the OST at fault since 2 OSTs on one OSS give
the expected ~115MB/s read rate but 2 OSTs on two OSSes give ~40MB/s.
> Alternately, it might be that you have configured the disk storage of
> OSS-1
> and OSS-2 to compete (e.g. different partitions sharing the same disks).
    Each OSS has two internal PCI 8 port 3ware 9550sx cards and 16 internal
disks carved into two separate 7+1 RAID 5 groups (one per card). 
They''re
physically distinct where disk storage is concerned.
> No, the client needs to assemble the OST objects itself, regardless of
> whether the OSTs are on the same OSS or not.  The file should be striped
> over all of the OSTs involved in the test.
   Iostat on each OST confirms the striping.  I see and don''t see
reads on OSTs where I''d expect as I change the striping. 
OST''s not in
use are quiescent.  OSTs in use show uniform read rates between them and
they have relative constant rates per second.  No starvation apparent.

   It sure seems like some issue on the client not being able to deal
with multiple streams from multiple OSS but can deal just fine with
multiple streams from a single OSS.

   I''ve tried to think of some way the switch could be at fault but
haven''t come up with anything.  It''s a Cisco 2960 gigabit
switch and while
it can block it shouldn''t be in this case.  I have no problem obtaining
115MB/s read writes as long as I avoid reading across two OSSes.

   Again, many thanks for the reply.  If nothing else knowing it really
is wrong will make me keep digging.  If you can think of any output I
could show or test I could do to help isolate the problem I''m all ears.

James Robnett
NRAO/NM

  Below is the read portion of brw_stats for OST0001 from the 40MB/s
run (left) and 115MB/s run (right), I removed the write portion for
clarity.

                      read (40MB/s)  |     read (115MB/s)
pages per bulk r/w     rpcs  % cum % |    rpcs  % cum %
1:                    5003  17  17   |   5256  18  18
2:                      13   0  17   |     23   0  18
4:                      11   0  17   |      1   0  18
8:                      19   0  17   |      1   0  18
16:                     14   0  17   |     11   0  18
32:                     53   0  17   |     18   0  18
64:                     47   0  17   |     11   0  18
128:                    74   0  17   |     35   0  18
256:                 24145  82 100   |  23415  81 100

                           read      |        read
discontiguous pages    rpcs  % cum % |    rpcs  % cum %
0:                   29261  99  99   |  28735  99  99
1:                      61   0  99   |     34   0  99
2:                      18   0  99   |      2   0 100
3:                      15   0  99   |      0   0 100
4:                       9   0  99   |      0   0 100
5:                       7   0  99   |      0   0 100
6:                       4   0  99   |      0   0 100
7:                       3   0  99   |      0   0 100
8:                       0   0  99   |      0   0 100
9:                       1   0 100   |      0   0 100
10:                      0   0 100   |      0   0 100
11:                      0   0 100   |      0   0 100
12:                      0   0 100   |      0   0 100
13:                      0   0 100   |

                           read      |        read
discontiguous blocks   rpcs  % cum % |   rpcs  % cum %
0:                   29261  99  99   |  28735  99  99
1:                      61   0  99   |     34   0  99
2:                      18   0  99   |      2   0 100
3:                      15   0  99   |      0   0 100
4:                       9   0  99   |      0   0 100
5:                       7   0  99   |      0   0 100
6:                       4   0  99   |      0   0 100
7:                       3   0  99   |      0   0 100
8:                       0   0  99   |      0   0 100
9:                       1   0 100   |      0   0 100
10:                      0   0 100   |      0   0 100
11:                      0   0 100   |      0   0 100
12:                      0   0 100   |      0   0 100
13:                      0   0 100   |

                           read      |        read
disk fragmented I/Os   ios   % cum % |   ios   % cum %
0:                       1   0   0   |   5308  18  18
1:                    5084  17  17   |     12   0  18
2:                      44   0  17   |     18   0  18
3:                      46   0  17   |     17   0  18
4:                      38   0  17   |     10   0  18
5:                      31   0  17   |     20   0  18
6:                      30   0  17   |     12   0  18
7:                      29   0  18   |  23353  81  99
8:                   24034  81  99   |     21   0 100
9:                      27   0  99   |      0   0 100
10:                      8   0  99   |      0   0 100
11:                      3   0  99   |      0   0 100
12:                      3   0  99   |      0   0 100
13:                      0   0  99   |
14:                      1   0 100   |

                           read      |        read
disk I/Os in flight    ios   % cum % |    ios   % cum %
1:                   15990   8   8   |  14821   7   7
2:                   16817   8  16   |  16105   8  16
3:                   15968   8  24   |  14930   7  23
4:                   15761   7  32   |  14260   7  31
5:                   16390   8  40   |  14644   7  38
6:                   17131   8  49   |  15039   7  46
7:                   17786   8  58   |  15383   7  54
8:                   18551   9  67   |  15887   8  62
9:                    7313   3  71   |   7218   3  66
10:                   7100   3  74   |   7006   3  70
11:                   6755   3  78   |   6824   3  73
12:                   6416   3  81   |   6738   3  77
13:                   5931   2  84   |   6438   3  80
14:                   5386   2  87   |   6209   3  83
15:                   4831   2  89   |   5983   3  86
16:                   4287   2  91   |   5540   2  89
17:                   2146   1  92   |   2314   1  90
18:                   1928   0  93   |   2213   1  92
19:                   1703   0  94   |   2046   1  93
20:                   1531   0  95   |   1911   0  94
21:                   1376   0  96   |   1772   0  95
22:                   1202   0  96   |   1602   0  95
23:                   1011   0  97   |   1398   0  96
24:                    749   0  97   |   1190   0  97
25:                    435   0  97   |    640   0  97
26:                    383   0  98   |    584   0  97
27:                    358   0  98   |    526   0  98
28:                    328   0  98   |    477   0  98
29:                    298   0  98   |    434   0  98
30:                    258   0  98   |    365   0  98
31:                   2559   1 100   |   2224   1 100

                           read      |        read
I/O time (1/1000s)     ios   % cum % |    ios   % cum %
1:                    1079   3   3   |    339   1   1
2:                    5565  18  22   |   3228  11  12
4:                    5672  19  41   |   6847  23  36
8:                    2649   9  50   |   4393  15  51
16:                   5967  20  71   |   8461  29  80
32:                   7243  24  95   |   4243  14  95
64:                   1073   3  99   |   1176   4  99
128:                   126   0  99   |     84   0 100
256:                     5   0 100   |      0   0 100
512:                     0   0 100   |      0   0 100

                           read      |        read
disk I/O size          ios   % cum % |    ios   % cum %
4K:                   5147   2   2   |   5263   2   2
8K:                     94   0   2   |     28   0   2
16K:                    18   0   2   |     11   0   2
32K:                    45   0   2   |     20   0   2
64K:                    98   0   2   |     48   0   2
128K:               193276  97 100   | 187351  97 100

Dr. Hung-Sheng Tsao

2009-Oct-18 21:45 UTC

head link

[Lustre-discuss] lustre 1.8.1.1 support matrix

hi
when are we going to see the new updatd  lustre 1.8.1.1 support matrix?
http://wiki.lustre.org/index.php/Lustre_Support_Matrix
only list 1.8.1

Another question
Does lustre support  Servers running the latest rhel 5.3 with 1.8.1.1
but client only running rhel5.1?
Can Client also run 1.8.1.1?
can this configuration run OFED 1.4.1 or 1.4.2?
TIA



-------------- next part --------------
A non-text attachment was scrubbed...
Name: hung-sheng_tsao.vcf
Type: text/x-vcard
Size: 273 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091018/31cc03ff/attachment.vcf

Sheila Barthel

2009-Oct-19 03:45 UTC

head link

[Lustre-discuss] lustre 1.8.1.1 support matrix

The support matrix has been updated for 1.8.1.1:

http://wiki.lustre.org/index.php/Lustre_Release_Information

Dr. Hung-Sheng Tsao wrote:> hi
> when are we going to see the new updatd  lustre 1.8.1.1 support matrix?
> http://wiki.lustre.org/index.php/Lustre_Support_Matrix
> only list 1.8.1
>
> Another question
> Does lustre support  Servers running the latest rhel 5.3 with 1.8.1.1
> but client only running rhel5.1?
> Can Client also run 1.8.1.1?
> can this configuration run OFED 1.4.1 or 1.4.2?
> TIA
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

James Robnett

2009-Oct-20 16:50 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

A bit more (likely not useful) info on this odd read performance
problem across multiple OSSes.

   We have a different prototype Lustre installation in another
group.  The hardware is a bit different but is also a RHEL 5.3
based system installed from RPM''s (v1.8). That system also sends
traffic through a Cisco Catalyst 2960 48port gigabit switch.

   We see the same odd performance issue on that Lustre system.
Reads limited to a single OSS are client network limited, reads that
involve more than one OSS are capped around 40MB/s.

   Considering nobody else has reported this it seems like it
must be either some oddity in our base RHEL 5.3 install or
some oddity in that Cisco switch.

    The switch has a 32gbit/s backplane and otherwise seems
perfectly capable of handling traffic from multiple clients at
much faster rates than we''re seeing.  Nor do I see any evidence
on the OSSes or switch that the switch becomes congested trying
to direct multiple read streams to the client''s 1gbit interface.

   It''s quite possible I''m looking for the wrong thing there,
suggestions welcome.

   I''m more inclined to believe it''s our base OS install.  It
continues to appear that the experienced read rate is driven by
the request rate from the client and not some reply bottleneck.
The client simply isn''t requesting reads at full speed if multiple
OSSes are involved.  There''s some evidence that this other Lustre
system used to get more typical read/write rates.  That suggests
some subsequent RHEL 5.3 patch that''s affecting the performance.

   BTW,  I''ve now tried bonnie++, iozone and IOR, all give similar
results so that rules out some bonnie++ pathology.

James Robnett
NRAO/NM

   8 OST''s across all 4 OSSes

[root at casa-dev-13 C]# lfs getstripe /lustre/casa-store/IOR/
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
/lustre/casa-store/IOR/
stripe_count: -1 stripe_size: 4194304 stripe_offset: 0

Run began: Tue Oct 20 10:39:22 2009
Command line used: ./IOR -r -w -b 64K -t 64K -s 270000 -N 1 -o 
/lustre/casa-store/IOR/junkMachine: Linux casa-dev-13

Summary:
         api                = POSIX
         test filename      = /lustre/casa-store/IOR/junk
         access             = single-shared-file
         ordering in a file = sequential offsets
         ordering inter file= no tasks offsets
         clients            = 1 (1 per node)
         repetitions        = 1
         xfersize           = 65536 bytes
         blocksize          = 65536 bytes
         aggregate filesize = 16.48 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min 
(OPs)  Mean (OPs)   Std Dev  Mean (s)Op grep #Tasks tPN reps  fPP reord 
reordoff reordrand seed segcnt blksiz xsize aggsize
---------  ---------  ---------  ----------   -------  --------- 
---------  ----------   -------   -------
write         116.01     116.01      116.01      0.00        0.01 
0.01        0.01      0.00 145.46395   1 1 1 0 0 1 0 0 270000 65536 
65536 17694720000 -1 POSIX EXCEL
read           44.29      44.29       44.29      0.00        0.00 
0.00        0.00      0.00 381.01059   1 1 1 0 0 1 0 0 270000 65536 
65536 17694720000 -1 POSIX EXCEL

Max Write: 116.01 MiB/sec (121.64 MB/sec)
Max Read:  44.29 MiB/sec (46.44 MB/sec)


   2 OSTs (0000 and 0001 on OSS-1, confirmed via iostat on OSS-1).

lfs getstripe /lustre/casa-store/IOR-OST-0-1/
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
/lustre/casa-store/IOR-OST-0-1/
stripe_count: 2 stripe_size: 4194304 stripe_offset: 0

Run began: Tue Oct 20 10:31:32 2009
Command line used: ./IOR -r -w -b 64K -t 64K -s 270000 -N 1 -o 
/lustre/casa-store/IOR-OST-0-1/junkMachine: Linux casa-dev-13

Summary:
         api                = POSIX
         test filename      = /lustre/casa-store/IOR-OST-0-1/junk
         access             = single-shared-file
         ordering in a file = sequential offsets
         ordering inter file= no tasks offsets
         clients            = 1 (1 per node)
         repetitions        = 1
         xfersize           = 65536 bytes
         blocksize          = 65536 bytes
         aggregate filesize = 16.48 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min 
(OPs)  Mean (OPs)   Std Dev  Mean (s)Op grep #Tasks tPN reps  fPP reord 
reordoff reordrand seed segcnt blksiz xsize aggsize
---------  ---------  ---------  ----------   -------  --------- 
---------  ----------   -------   -------
write         118.45     118.45      118.45      0.00        0.01 
0.01        0.01      0.00 142.46609   1 1 1 0 0 1 0 0 270000 65536 
65536 17694720000 -1 POSIX EXCEL
read          117.64     117.64      117.64      0.00        0.01 
0.01        0.01      0.00 143.44616   1 1 1 0 0 1 0 0 270000 65536 
65536 17694720000 -1 POSIX EXCEL

Max Write: 118.45 MiB/sec (124.20 MB/sec)
Max Read:  117.64 MiB/sec (123.35 MB/sec)

Martin Pokorny

2009-Oct-20 18:04 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

James Robnett wrote:>    BTW,  I''ve now tried bonnie++, iozone and IOR, all give similar
> results so that rules out some bonnie++ pathology.
FWIW, I tried plain old dd on one of the Lustre filesystems (striping 
across OSTs on multiple OSSs), and results were similar.

-- 
Martin

James Robnett

2009-Oct-22 19:48 UTC

head link

[Lustre-discuss] Slow read performance across OSSes

The problem appears to be network congestion control on the
OSSes triggered by these Cisco 2960 switches inability to deal
with over-subscription very well.

   The problem even occurs if the client has a channel bonded
2 x 1gbit interface pair and only two OSSes are involved.
Sadly it was that result that led me to believe the problem was
on the client and not the switch or the OSSes.

    I connected the OSSes and client to an el cheapo Allied
Telesyn 8 port 1gbit switch. A client with a single 1gbit interface
and a test with two OSSes resulted in 116MB writes and reads.

   A second test involving 4 OSSes (each with two OST''s) reverted
to the 116MB/s writes and 40''ish MB/s reads which implies the
AT switch is better but there''s still a problem.

   Looking at the OSSes I discovered some sub-optimal IP stack
settings.  In particular

net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

   Setting those both to *1* improved the AT switch case to about 78MB/s
reads across 4 OSSes but that switch doesn''t support 9000 MTU.

   Fixing up the OSSes with those IP settings and returning to the
original switch (which does support 9000MTU) seems to be the best case.

Across 4 OSSes w/8 OSTs w/4MB stripe size 9000 MTU
115MB/s writes, 106MB/s reads

Across 4 OSSes w/8 OSTs w/1MB stripe size 9000 MTU
115MB/s writes, 111MB/s reads

   So for now I''d say it''s all better though I''ll be
suspicious of our
settings till I see a scaled up version running on a newer switch
with full throughput.

   I did some "site:lists.lustre.org <string>" type searches for
congestion, tcp sysctl and came up with very little.  Are there
best practice type tcp settings for Lustre in 1gbit with channel
bonding (as opposed to IB or 10G) type environments.  We have our
own set here we''ve empirically settled on.

James Robnett

   Here are all the changes we make beyond stock settings.

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 3000
# Added for Lustre
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1

Lustre discuss - Oct 2009 - Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] lustre 1.8.1.1 support matrix

[Lustre-discuss] lustre 1.8.1.1 support matrix

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes

[Lustre-discuss] Slow read performance across OSSes