thr3ads.net - Lustre discuss - [Lustre-discuss] performance tuning [May 2006]

If this information is useful, please help other people find it:
Share via:

smee

2006-May-19 07:36 UTC

[Lustre-discuss] performance tuning

Hi Eric,

I would love to trade notes with you through all this.
Environment is:
- Latest Celeron 2.4Ghz, 512MB RAM, 80GB Western Digital @7200rpm
- Redhat 9
- Intel 1000 MT running stock driver from RH9 install
- Cheap 8 port DLink gigabit switch(DGS-1008D)
- Lustre 1.0 rpm release

I run Openssi providing a two node cluster as the client, and 2 OSS''s
with roughly 70GB partition each, totalling  140GB lustre partition.
As a test, the MDS is running on an older Pentium III 733Mhz with
256MB ram on a 100mbit NIC.

What I''ve noticed so far:
- Tuning the network performance according to SpecWeb99 doesn''t help.
- Tuning the network for gigabit architecture according to many
websites does not help.
- Tweaking the NIC''s interrupt throttle rates, delays and buffers do
not help.

This all leads to the fact that the network is currently not the
bottleneck for a 2 node OSS, whether running a 64k, 512k, 1mb or 2mb
LOV stripe.
What I DID stumble across late last night is that if, on the client, I
mount, dismount, mount again, I get a good flat performance across the
spectrum using iozone. However, if I leave it for a while after the
test, and rerun the test again, I fall back into that familiar big dip
in performance when the transfer block size hits 8192. Perhaps this is
an anomoly already fixed in Lustre 1.2.x, but I''m not privy to that at
the moment.
I''ve played with some of the proc files under
/proc/fs/lustre/osc/OSC*/max_rpcs_in_flight, etc., but they don''t
appear to make much of a difference yet.

I''ve also tested performance with bonnie++ and the difference between
a lustre mount and the raw drive performance is really the random
reads/writes and random & sequential file creation/deletions. This is
where you MUST have a fast MDS and fast NIC on the MDS because all
fast random read/writes and file creation/deletion is making use of
the metastore on the MDS machine.

What have your experiences been? I''m curious to know what your
progress has been so far.

Cheers,
Cuong.


On Wed, 22 Sep 2004 23:04:08 +0100, Eric Barton <eeb@bartonsoftware.com>
wrote:> Cuong,
> 
> How wonderful.  I''ll plot the numbers with great interest.
> 
> Can you tell me the architecture/network/os etc stuff?
> 
> Have you got raw device performance graphs?
> 
> We''re working on smoothing out performance and beginning to get
> somewhere.  Your work can only help.
> 
>                 Cheers,
>                         Eric
> 
> ---------------------------------------------------
> |Eric Barton        Barton Software               |
> |9 York Gardens     Tel:    +44 (117) 330 1575    |
> |Clifton            Mobile: +44 (7909) 680 356    |
> |Bristol BS8 4LL    Fax:    call first            |
> |United Kingdom     E-Mail: eeb@bartonsoftware.com|
> ---------------------------------------------------
> 
> 
> 
> 
> > -----Original Message-----
> > From: lustre-discuss-admin@lists.clusterfs.com
> > [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of smee
> > Sent: Wednesday, September 22, 2004 5:15 AM
> > To: lustre-discuss@lists.clusterfs.com
> > Subject: [Lustre-discuss] performance tuning
> >
> >
> > Hi,
> >
> > I''ve got an MDS and serveral OSS''s setup properly
and am going through
> > getting an idea of how to performance tune things.
> > I''ve used iozone directly on each OSS itself, then comparing
against
> > iozone run on a lustre mount. I''m getting poor performance
from the
> > lustre mount as compared to raw readings from the drive itself. Even
> > if I ignore the absolute figures, looking at the graphs plotted from
> > the iozone readings, the graphs for writes look very different. On the
> > raw drive readings graph, you can see a uniform surface area across
> > the write-blocksize spectrum, whereas on the lustre mount, there are
> > large valleys and troughs. This tells me that either the network is
> > poorly tuned or the lustre component is poorly tuned or a combination
> > of both. When I say poorly tuned, I mean for my scenario,
> > hardware-wise, environment-wise.
> > Sample of iozone output for raw drive:
> >       4       8       16      32      64      128     256
> > 512   1024    2048    4096    8192    16384
> > 64    87066   90152   93167   85438   79692
> >
> > 128   86192   98307   115324  95234   90582   83877
> >
> > 256   90296   94847   103646  99259   96021   90811   84318
> >
> > 512   90268   98840   102095  99668   92152   90762   91395
> > 87268
> > 1024  90957   98775   100907  99242   92964   92044   91429
> > 93133 91756
> > 2048  89859   98537   101552  99071   94121   92052   93251
> > 92439 93192   91310
> > 4096  89362   99239   100857  98206   93614   92964   92571
> > 92743 92739   92774   91310
> > 8192  88024   97479   101247  98256   92986   91263   91393
> > 85121 92203   92530   92671   91435
> > 16384 89353   97054   99498   97740   92662   90705   91244
> > 91147 90493   91031   91213   91273   90831
> > 32768 86754   92916   98786   96950   91569   89815   90547
> > 90980 90949   92074   92071   92312   91819
> > 65536 89665   96852   99777   97932   92776   91595   91472
> > 92044 91083   92173   92707   92505   92220
> > 131072        89967   97431   99183   98332   93049   91462
> > 91087 92142   91978   92585   92550   92607   92601
> > 262144        86324   91915   96636   92946   88862   86880
> > 87787 88801   89097   88852   88873   88662   89203
> >
> >
> > Sample of iozone output for lustre mount
> >       4       8       16      32      64      128     256
> > 512   1024    2048    4096    8192    16384
> > 64    25974   26478   26891   27622   27245
> >
> > 128   46060   47075   49325   49942   48339   47547
> >
> > 256   63193   69698   72560   69699   72563   74943   69377
> >
> > 512   89604   91904   101649  104959  97466   95150   100020
> > 58274
> > 1024  108634  116973  123610  131516  120653  118463  125907
> > 124031        118930
> > 2048  49844   51315   51991   54151   53158   52519   53311
> > 52965 53704   53196
> > 4096  37631   38040   40073   38878   39707   38800   40329
> > 39514 40555   39780   40255
> > 8192  33015   34176   35269   34960   35064   35048   35366
> > 35718 35622   35759   35900   35795
> > 16384 31763   32851   33470   33434   33251   33886   33734
> > 58253 37714   72057   34277   38210   54380
> > 32768 35487   57635   125766  92239   99827   90024   140139
> > 145908        145645  149407  147299  148796  148365
> > 65536 127508  140998  153479  149216  146867  144750  147058
> > 149344        150468  150944  151472  147819  150513
> > 131072        119698  135814  150307  148915  145023  143271
> > 145509        148743  150515  151543  152708  154004  154315
> > 262144        47873   50045   51672   51218   51160   50838
> > 50844 51929   51459   51596   51809   52068   51998
> >
> >
> > Are there any lustre tunable variables that we can fiddle with to get
> > better performance or even just to get more uniform performance rather
> > than large dips depending on block size, etc. ?
> > I noticed whilst going back over the mailing list archives, that there
> > was mention of a few variables:
> >
> > echo 0 > /proc/sys/portals/debug
> >
> > # mds and client
> > for osc in /proc/fs/lustre/osc/OSC*/; do
> >         # increase number of concurrent requests to each OST
> >         if [ -f $osc/max_rpcs_in_flight ]; then
> >                 echo 8 > $osc/max_rpcs_in_flight
> >         fi
> >
> >         # increase client-side writeback cache
> >         if [ -f $osc/max_dirty_mb ]; then
> >                 echo 64 > $osc/max_dirty_mb
> >         fi
> > done
> >
> >
> > # ost
> > for ost in /proc/fs/lustre/obdfilter/ost*/; do
> >         # only read cache files up to 32MB, not larger files
> >         if [ -f $ost/readcache_max_filesize ]; then
> >                 echo $((32*1024*1024)) >
$ost/readcache_max_filesize
> >         fi
> > done
> >
> >
> > I''m just wondering if there are any more we can tweak?
> >
> > Cheers,
> > Cuong.
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.clusterfs.com
> > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> >
> >
>

smee

2006-May-19 07:36 UTC

head link

[Lustre-discuss] performance tuning

I noticed the lustre.pdf makes mention of lmc --mountfsoptions <option>.
Is that applicable to Lustre 1.2.x only? I''m using 1.0 and it gives me
an error about unrecognised --mountfsoptions.
I was trying to see if I could pass it options like
"noatime,nodirtime".

Also, the same docs mention a lmc --add net --tcpbuf <size> option
where the default for size is 1MB, but when mounting the respective
machines (mds, ost, client) in verbose mode, the debug shows that the
send_mem and recv_mem is 8388608. So again, are the docs describing
1.2.x or are they incorrect for this option?

Cheers,
Cuong


On Thu, 23 Sep 2004 11:54:09 +1000, smee <snotmee@gmail.com>
wrote:> Hi Eric,
> 
> I would love to trade notes with you through all this.
> Environment is:
> - Latest Celeron 2.4Ghz, 512MB RAM, 80GB Western Digital @7200rpm
> - Redhat 9
> - Intel 1000 MT running stock driver from RH9 install
> - Cheap 8 port DLink gigabit switch(DGS-1008D)
> - Lustre 1.0 rpm release
> 
> I run Openssi providing a two node cluster as the client, and 2
OSS''s
> with roughly 70GB partition each, totalling  140GB lustre partition.
> As a test, the MDS is running on an older Pentium III 733Mhz with
> 256MB ram on a 100mbit NIC.
> 
> What I''ve noticed so far:
> - Tuning the network performance according to SpecWeb99 doesn''t
help.
> - Tuning the network for gigabit architecture according to many
> websites does not help.
> - Tweaking the NIC''s interrupt throttle rates, delays and buffers
do not help.
> 
> This all leads to the fact that the network is currently not the
> bottleneck for a 2 node OSS, whether running a 64k, 512k, 1mb or 2mb
> LOV stripe.
> What I DID stumble across late last night is that if, on the client, I
> mount, dismount, mount again, I get a good flat performance across the
> spectrum using iozone. However, if I leave it for a while after the
> test, and rerun the test again, I fall back into that familiar big dip
> in performance when the transfer block size hits 8192. Perhaps this is
> an anomoly already fixed in Lustre 1.2.x, but I''m not privy to
that at
> the moment.
> I''ve played with some of the proc files under
> /proc/fs/lustre/osc/OSC*/max_rpcs_in_flight, etc., but they don''t
> appear to make much of a difference yet.
> 
> I''ve also tested performance with bonnie++ and the difference
between
> a lustre mount and the raw drive performance is really the random
> reads/writes and random & sequential file creation/deletions. This is
> where you MUST have a fast MDS and fast NIC on the MDS because all
> fast random read/writes and file creation/deletion is making use of
> the metastore on the MDS machine.
> 
> What have your experiences been? I''m curious to know what your
> progress has been so far.
> 
> Cheers,
> Cuong.
> 
> 
> 
> 
> On Wed, 22 Sep 2004 23:04:08 +0100, Eric Barton
<eeb@bartonsoftware.com> wrote:
> > Cuong,
> >
> > How wonderful.  I''ll plot the numbers with great interest.
> >
> > Can you tell me the architecture/network/os etc stuff?
> >
> > Have you got raw device performance graphs?
> >
> > We''re working on smoothing out performance and beginning to
get
> > somewhere.  Your work can only help.
> >
> >                 Cheers,
> >                         Eric
> >
> > ---------------------------------------------------
> > |Eric Barton        Barton Software               |
> > |9 York Gardens     Tel:    +44 (117) 330 1575    |
> > |Clifton            Mobile: +44 (7909) 680 356    |
> > |Bristol BS8 4LL    Fax:    call first            |
> > |United Kingdom     E-Mail: eeb@bartonsoftware.com|
> > ---------------------------------------------------
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: lustre-discuss-admin@lists.clusterfs.com
> > > [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of
smee
> > > Sent: Wednesday, September 22, 2004 5:15 AM
> > > To: lustre-discuss@lists.clusterfs.com
> > > Subject: [Lustre-discuss] performance tuning
> > >
> > >
> > > Hi,
> > >
> > > I''ve got an MDS and serveral OSS''s setup
properly and am going through
> > > getting an idea of how to performance tune things.
> > > I''ve used iozone directly on each OSS itself, then
comparing against
> > > iozone run on a lustre mount. I''m getting poor
performance from the
> > > lustre mount as compared to raw readings from the drive itself.
Even
> > > if I ignore the absolute figures, looking at the graphs plotted
from
> > > the iozone readings, the graphs for writes look very different.
On the
> > > raw drive readings graph, you can see a uniform surface area
across
> > > the write-blocksize spectrum, whereas on the lustre mount, there
are
> > > large valleys and troughs. This tells me that either the network
is
> > > poorly tuned or the lustre component is poorly tuned or a
combination
> > > of both. When I say poorly tuned, I mean for my scenario,
> > > hardware-wise, environment-wise.
> > > Sample of iozone output for raw drive:
> > >       4       8       16      32      64      128     256
> > > 512   1024    2048    4096    8192    16384
> > > 64    87066   90152   93167   85438   79692
> > >
> > > 128   86192   98307   115324  95234   90582   83877
> > >
> > > 256   90296   94847   103646  99259   96021   90811   84318
> > >
> > > 512   90268   98840   102095  99668   92152   90762   91395
> > > 87268
> > > 1024  90957   98775   100907  99242   92964   92044   91429
> > > 93133 91756
> > > 2048  89859   98537   101552  99071   94121   92052   93251
> > > 92439 93192   91310
> > > 4096  89362   99239   100857  98206   93614   92964   92571
> > > 92743 92739   92774   91310
> > > 8192  88024   97479   101247  98256   92986   91263   91393
> > > 85121 92203   92530   92671   91435
> > > 16384 89353   97054   99498   97740   92662   90705   91244
> > > 91147 90493   91031   91213   91273   90831
> > > 32768 86754   92916   98786   96950   91569   89815   90547
> > > 90980 90949   92074   92071   92312   91819
> > > 65536 89665   96852   99777   97932   92776   91595   91472
> > > 92044 91083   92173   92707   92505   92220
> > > 131072        89967   97431   99183   98332   93049   91462
> > > 91087 92142   91978   92585   92550   92607   92601
> > > 262144        86324   91915   96636   92946   88862   86880
> > > 87787 88801   89097   88852   88873   88662   89203
> > >
> > >
> > > Sample of iozone output for lustre mount
> > >       4       8       16      32      64      128     256
> > > 512   1024    2048    4096    8192    16384
> > > 64    25974   26478   26891   27622   27245
> > >
> > > 128   46060   47075   49325   49942   48339   47547
> > >
> > > 256   63193   69698   72560   69699   72563   74943   69377
> > >
> > > 512   89604   91904   101649  104959  97466   95150   100020
> > > 58274
> > > 1024  108634  116973  123610  131516  120653  118463  125907
> > > 124031        118930
> > > 2048  49844   51315   51991   54151   53158   52519   53311
> > > 52965 53704   53196
> > > 4096  37631   38040   40073   38878   39707   38800   40329
> > > 39514 40555   39780   40255
> > > 8192  33015   34176   35269   34960   35064   35048   35366
> > > 35718 35622   35759   35900   35795
> > > 16384 31763   32851   33470   33434   33251   33886   33734
> > > 58253 37714   72057   34277   38210   54380
> > > 32768 35487   57635   125766  92239   99827   90024   140139
> > > 145908        145645  149407  147299  148796  148365
> > > 65536 127508  140998  153479  149216  146867  144750  147058
> > > 149344        150468  150944  151472  147819  150513
> > > 131072        119698  135814  150307  148915  145023  143271
> > > 145509        148743  150515  151543  152708  154004  154315
> > > 262144        47873   50045   51672   51218   51160   50838
> > > 50844 51929   51459   51596   51809   52068   51998
> > >
> > >
> > > Are there any lustre tunable variables that we can fiddle with to
get
> > > better performance or even just to get more uniform performance
rather
> > > than large dips depending on block size, etc. ?
> > > I noticed whilst going back over the mailing list archives, that
there
> > > was mention of a few variables:
> > >
> > > echo 0 > /proc/sys/portals/debug
> > >
> > > # mds and client
> > > for osc in /proc/fs/lustre/osc/OSC*/; do
> > >         # increase number of concurrent requests to each OST
> > >         if [ -f $osc/max_rpcs_in_flight ]; then
> > >                 echo 8 > $osc/max_rpcs_in_flight
> > >         fi
> > >
> > >         # increase client-side writeback cache
> > >         if [ -f $osc/max_dirty_mb ]; then
> > >                 echo 64 > $osc/max_dirty_mb
> > >         fi
> > > done
> > >
> > >
> > > # ost
> > > for ost in /proc/fs/lustre/obdfilter/ost*/; do
> > >         # only read cache files up to 32MB, not larger files
> > >         if [ -f $ost/readcache_max_filesize ]; then
> > >                 echo $((32*1024*1024)) >
$ost/readcache_max_filesize
> > >         fi
> > > done
> > >
> > >
> > > I''m just wondering if there are any more we can tweak?
> > >
> > > Cheers,
> > > Cuong.
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss@lists.clusterfs.com
> > > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> > >
> > >
> >
>

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] performance tuning

Hello Cuong--

On 9/23/2004 2:15, "smee" <snotmee@gmail.com> wrote:
> I noticed the lustre.pdf makes mention of lmc --mountfsoptions
<option>.
> Is that applicable to Lustre 1.2.x only? I''m using 1.0 and it
gives me
> an error about unrecognised --mountfsoptions.
> I was trying to see if I could pass it options like
"noatime,nodirtime".
This option is supported by lmc in 1.2.x.
 > Also, the same docs mention a lmc --add net --tcpbuf <size> option
> where the default for size is 1MB, but when mounting the respective
> machines (mds, ost, client) in verbose mode, the debug shows that the
> send_mem and recv_mem is 8388608. So again, are the docs describing
> 1.2.x or are they incorrect for this option?
This option is not yet supported.  It''s important to realize that
lustre.pdf
often describes what we want, and not what is exactly implemented.  There
are several sections which describe features which are not yet built, or
which are not included in our production releases.

-Phil

smee

2006-May-19 07:36 UTC

head link

[Lustre-discuss] performance tuning

Hi,

I''ve got an MDS and serveral OSS''s setup properly and am going
through
getting an idea of how to performance tune things.
I''ve used iozone directly on each OSS itself, then comparing against
iozone run on a lustre mount. I''m getting poor performance from the
lustre mount as compared to raw readings from the drive itself. Even
if I ignore the absolute figures, looking at the graphs plotted from
the iozone readings, the graphs for writes look very different. On the
raw drive readings graph, you can see a uniform surface area across
the write-blocksize spectrum, whereas on the lustre mount, there are
large valleys and troughs. This tells me that either the network is
poorly tuned or the lustre component is poorly tuned or a combination
of both. When I say poorly tuned, I mean for my scenario,
hardware-wise, environment-wise.
Sample of iozone output for raw drive:
	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	87066	90152	93167	85438	79692								
128	86192	98307	115324	95234	90582	83877							
256	90296	94847	103646	99259	96021	90811	84318						
512	90268	98840	102095	99668	92152	90762	91395	87268					
1024	90957	98775	100907	99242	92964	92044	91429	93133	91756				
2048	89859	98537	101552	99071	94121	92052	93251	92439	93192	91310			
4096	89362	99239	100857	98206	93614	92964	92571	92743	92739	92774	91310		
8192	88024	97479	101247	98256	92986	91263	91393	85121	92203	92530	92671	91435	
16384	89353	97054	99498	97740	92662	90705	91244	91147	90493	91031	91213	91273
90831
32768	86754	92916	98786	96950	91569	89815	90547	90980	90949	92074	92071	92312
91819
65536	89665	96852	99777	97932	92776	91595	91472	92044	91083	92173	92707	92505
92220
131072	89967	97431	99183	98332	93049	91462	91087	92142	91978	92585	92550	92607
92601
262144	86324	91915	96636	92946	88862	86880	87787	88801	89097	88852	88873	88662
89203


Sample of iozone output for lustre mount
	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	25974	26478	26891	27622	27245								
128	46060	47075	49325	49942	48339	47547							
256	63193	69698	72560	69699	72563	74943	69377						
512	89604	91904	101649	104959	97466	95150	100020	58274					
1024	108634	116973	123610	131516	120653	118463	125907	124031	118930				
2048	49844	51315	51991	54151	53158	52519	53311	52965	53704	53196			
4096	37631	38040	40073	38878	39707	38800	40329	39514	40555	39780	40255		
8192	33015	34176	35269	34960	35064	35048	35366	35718	35622	35759	35900	35795	
16384	31763	32851	33470	33434	33251	33886	33734	58253	37714	72057	34277	38210
54380
32768	35487	57635	125766	92239	99827	90024	140139	145908	145645	149407	147299
148796	148365
65536	127508	140998	153479	149216	146867	144750	147058	149344	150468	150944
151472	147819	150513
131072	119698	135814	150307	148915	145023	143271	145509	148743	150515	151543
152708	154004	154315
262144	47873	50045	51672	51218	51160	50838	50844	51929	51459	51596	51809	52068
51998


Are there any lustre tunable variables that we can fiddle with to get
better performance or even just to get more uniform performance rather
than large dips depending on block size, etc. ?
I noticed whilst going back over the mailing list archives, that there
was mention of a few variables:

echo 0 > /proc/sys/portals/debug

# mds and client
for osc in /proc/fs/lustre/osc/OSC*/; do
        # increase number of concurrent requests to each OST
        if [ -f $osc/max_rpcs_in_flight ]; then
                echo 8 > $osc/max_rpcs_in_flight    
        fi

        # increase client-side writeback cache
        if [ -f $osc/max_dirty_mb ]; then
                echo 64 > $osc/max_dirty_mb 
        fi
done


# ost
for ost in /proc/fs/lustre/obdfilter/ost*/; do
        # only read cache files up to 32MB, not larger files
        if [ -f $ost/readcache_max_filesize ]; then
                echo $((32*1024*1024)) > $ost/readcache_max_filesize
        fi
done


I''m just wondering if there are any more we can tweak?

Cheers,
Cuong.

Lustre discuss - May 2006 - performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning