thr3ads.net - Lustre devel - [Lustre-devel] LustreFS performance (update) [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Vitaly Fertman

2009-Mar-19 19:34 UTC

[Lustre-devel] LustreFS performance (update)

****************************************************
	LustreFS benchmarking methodology.
****************************************************

The document aims to describe the benchmarking methodology which helps
to understand the LustreFS performance and reveal LustreFS bottlenecks  
in
different configurations on different hardware, to ensure the next  
LustreFS
release does not downgrade comparing with a previous one. In other  
words:
	Goal1. Understand the HEAD performance.
	Goal2. Compare HEAD and b1_6 (b1_8) performance.

To achieve the Goal1, the methodology suggests to test different  
layers of
software in the bottom-top direction, i.e. the underlying back-end,  
the target
server sitting on this back-end, the network connected to this target  
and how
the target performs through this network, etc up to the whole cluster.
Each next step has only 1 change over the previous one, it is either a  
new layer
added or 1 parameter in the configuration is changed (probably another  
network
type or another back-end). Comparing the results of each test with the  
previous
test, we get the overhead of the added layer or the performance impact  
of
changing this parameter.

To achieve the Goal2, the methodology suggests to go in the reverse  
top-bottom
direction, i.e. to test some large sub-systems first and, if a  
downgrade vs. a previous
LustreFS version is detected, to perform more detailed tests. (This is  
considered as
the primary goal of the 2.0 Performance Team).

The document does not cover the way of fixing revealed problems,  
probably some
special purpose test needs to be run or oprofile needs to be compiled  
in -- it is our
of scope of the document.

Obviously, it is not possible to perform all the thousands of tests in  
all the configurations,
running all the special purpose tests, etc, the document tries to  
prepare:
1) all the essential and sufficient tests to see how the system  
performs in general;
2) some minimal amount of essential tests to see how the system scales  
in different
conditions.
Therefore, the plan does not guarantee we will not miss a bottleneck  
or a bug, it just
tries to cover maximum possible scenarios in most interesting  
conditions/environment
states.

The amount of tests described below is already about 2K, and there  
will be definitely more,
and it will take a lot of time to perform all of them and to analyze  
the results.  So one of
the major concerns here is how to minimize the amount of test so that  
we would not miss
some interesting case and would be able to get all the results within  
a reasonable amount of
time. Please keep it in mind while looking at the tests below.

**** Hardware Requirements. ****

The test plan implies that we change only 1 parameter (cpu or disk or  
network)
on each step. The HW requirements are:

-- at least 1 node with:
  CPU:32;
  RAM: enough to have a ramdisk for MDS;
  DISK: enough disks for raid6 or raid1+0 (as this node could be mds  
or ost);
	  an extra disk for external journal;
  NET: both GiGe and IB installed.
-- at least 1 another node includes:
  DISK: enough disks for raid6 or raid1+0 (as this node could be mds  
or ost);
	  an extra disk for external journal;
-- besides that: 8 clients, 3 other servers.
-- the other servers include:
  DISK: raid6
  NET: IB installed.
-- client includes:
NET: both GiGe and IB installed.

**** Software requirements ****

1. Short term.
1.1 mdsrate
to be completed to test all the operations listed in MDST3 (see below).
1.2 mdsrate-**.sh
to be fixed/written to run mdsrate properly and test all the  
operations listed in
MDST3 (see below).
1.3. fake disk
implement FAIL flag and report ''done'' without doing anything
in
obdfilter to get
a low-latency disk.
1.4. MT.
add more tests here and implement them.

2. Long term.
2.1. mdtstack-survey
- an echo client-server is to be written for mds similar to ost.
- a test script similar to obdfilter-survey.sh is to be written.

**** Different configurations ****

Configuration of Node:
RAM. Amount of RAM on nodes (?)
CPU. Count of CPUs on nodes (1..32)
DISK. Disk type (raid, ramdisk, fake)
JOUR. Journal type (internal, external, ram)

Q:  which raid?
A: use RAID 1+0 for MDS; RAID6 for OST.

fake: to get a low-latency disk, it is preferable to report
''done''
without doing anything
in obdfilter once some FAIL flag is set. It is useful for OST testing,  
because first of all,
it does not have a CPU overhead of memcpy of using ramdisk and it lets  
to test large
amount of data in contrast to ramdisk. As a drawback, it skips the  
localfs code paths.

Note: OSS back-end has write through cache; MDS back-end has write- 
back cache.

Configuration of Cluster:
CL. Amount of clients (1,2,4,8)
MDS. Amount of MDS nodes (1,2,4).
OSS. Amount of OSS nodes (1,2,4)
NET. Network type (GiGe, IB)
OSTN. Amount of OST per nodes (1,2,4)

Configuration of test.
TH. Amount of threads per client (1,2,4,8)
VER. Lustre version (b1_6, HEAD. later b1_8).
FEAT. Lustre features to turn off (COS, SA, RA, debug messages)
TEST. Specific test parameters.

**** Testing ****
Low Layers Testing (LLT)
LLT1. Raw disk (lustre-iokit:sgpdd-survey)
LLT2. Local filesystem (lustre-iokit: ior-survey, is fs mounted  
synchronously?)

Network Testing (NETT).
NETT1. lnet: lnetself test.
NETT2. OBD: lustre-iokit: (obdfilter-survey, 	echo_client-osc-..- 
net-..-ost-echo_server)
NETT3. MD: (not ready)

OST Testing (OSTT).
OSTT1. Isolated OST (lustre-iokit: obdfilter-survey, 	echo_client- 
obdfilter-..-disk)
OSTT2. Remote OST (lustre-iokit: obdfilter-survey,	echo_client-osc-..- 
ost-obdfilter-..-disk)
OSTT3. Client-OST IO (lustre-iokit: ior-survey, client-ost-disk).

MDS Testing (MDST).
MDST1. Isolated MDS test (not ready)
MDST2. Remote MDS test (not ready)
MDST3. Simple Client-MDS operation test

Mixed testing (MT) (not ready)

**** Statistics ****

During all the tests the following is supposed to be running on all  
the servers:
1) HP collectl or LLNL''s LMT;
2) smth else?

*** Goal1. Understand the HEAD performance. ***

The Goal1 describes the testing methodology in the bottom-top direction,
from the lower layers (disk) to the complete Lustre cluster.

LLT1. Raw disk (lustre-iokit:sgpdd-survey)
RAM: fixed
CPU: 1
DISK: raid,ramdisk,fake (default=raid)
JOUR:-
CL: 1
OSS:1
NET: -
OSTN:-
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*)bulk size is specified as rszlo/rszhi=[1,4,64,1024K]
*)TH is specified as thrlo/thrhi=[1,2,4,8]
*)the amount of objects to work on in parallel: crglo=crghi=[1;TH]
i.e. test only cases when all the threads work on the same file and
when all of them work on a separate file.
TEST=[bulk;separate or commin file]=8 tests;

Test matrix(TESTxTHxDISK):
Run TESTs with different amount of threads for each DISK.
TESTxTHxDISK=(8x4 - 1)x3=93 tests.
"-1" because TH=1 is already covered.

Total:93 tests.

*** NETT1. lnetself test.***

RAM: fixed
CPU: 1,8,32 (default=1)
DISK: -
JOUR:-
CL: 1,8 (default=1)
OSS:1
NET: GiGe, IB (default=IB)
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*) test type: PING,READ,WRITE tests
*) bulk size for READ/WRITE: 1k,4k,64k,1M
[1 ping + 4 reads + 4 writes] = 9 tests

Test matrix (TESTxCLxTHxNETxCPU):
1. Multi-thread test.
Run TESTs on CL=1 with different amount of threads.
TESTxTH=[1+4+4]x4=36 tests.
2. Multi-client test
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=9x1x4=36 tests.
3. Network test
As the nature of IB is different from GiGe, we need to repeat all the  
tests from (1,2) here.
36+36=72 tests.
4. CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel.
At the same time, if some HW (network) limit is reached, the result  
will not be
very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 
1024K]:
[CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=5x1x4x(3-1)=40.

Total: 184 tests.

*** NETT2. OBD performance ***
lustre-iokit: obdfilter-survey, case=network.
	
The results of this tests are to be compared with lnet results to get
the osc+ost+ptlrpc overhead.

RAM: fixed
CPU: 1
DISK: -
JOUR:-
CL: 1,8 (default=1)
OSS:1
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*) bulk size: rszlo=rszhi=N (1,4,64,1024)
*) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8)
*) the amount of objects is: nobjlo=nobjhi=[1;TH]
i.e. test only cases when all the threads work on the same file and
when all of them work on a separate file.
TEST=[4 bulks; common or separate file]=8 tests

Test matrix(TESTxTHxCL):
1. Multi-thread test.
Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests
2. Multi-client test
Note: to be more demonstrative, the maximum amount of threads should  
be taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=8x1x4=32 tests.
3. Network test.
Having IB results in hand after (1,2) and these results from NETT1, we  
already see how
osc+ost+ptlrpc changes the behavior. There is no reason to repeat them  
for GiGe, it seems.
4.CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel.
At the same time, if some HW (network) limit is reached, the result  
will not be
very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 
1024K]:
[CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48.

Total: 112 tests.

*** OSTT1. Isolated OST ***
lustre-iokit: obdfilter-survey, case=disk

The results of this tests are to be compared with LLT results to get  
the OST
stack overhead.

RAM: fixed
CPU: 1,8,32 (default=1)
DISK: raid, fake (default=fake)
JOUR: int, ext, ram, (default=ext)
CL: 1
OSS:1
NET: -
OSTN:1,2,4 (default=1)
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*) bulk size: rszlo=rszhi=N (1,4,64,1024K)
*) TH is specified through: thrlo=1, thrhi=8 (1,2,4,8)
*) each OST is supposed to be configured on a separate disk.
*) the amount of objects is: nobjlo=nobjhi=[1;TH]
i.e. test only cases when all the threads work on the same file and
when all of them work on a separate file.
TEST=[4 bulks; common of separate file]=8 tests

Test matrix(TESTxTHxOSTNxDISKxJOURxCPU):
1. Multi-thread test.
Run TESTs on OSTN=1 with different amount of threads. TESTxTH=8x4=32  
tests
2. Multi-OST test
2.1. Let''s check how OSTs vs. threads per OST scale (TH=OSTN).
2.2. Let''s check how the system scale with many OSTs and threads  
(TH=8*OSTN).
[OSTN>1;TH=OSTN,8*OSTN]. TESTxOSTNxTH=8x2x2=32 tests.
3. DISK test
As other disks are completely different, so lets repeat most of the  
(1,2) for 2 others:
[TH=OSTN;8*OSTN]: TESTxOSTNxTHxDISK=8x3x2x1=48
4. JOURNAL test.
Limit the tests with only raid-disk.
Limit the test with only 1 large and 1 small bulk:[1,1024K].
TESTxOSTNxTHxJOUR: 4x3x2x2=48
5. CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU. It is better to perform it on a  
fast
backend (DISK=fake) to see how CPU really matters.
It is mostly interesting to look at large amount of threads, as we are  
going
to benefit from handling them in parallel.
Also, run with a small & a large bulk only:[1,1024K]
[OSTN=4,TH=1,2,4,8]: TESTxOSTNxTHxCPU=4x1x4x2=32

Total: 192 tests.

*** OSTT2. Real OST test ***
lustre-iokit: obdfilter-survey, case=netdisk

This test is a composition of OBD performance and Isolated OST tests,
so its results are to be compared with NETT2 & OSTT1 results.

RAM: fixed
CPU: 1,8,32 (default=1)
DISK: fake
JOUR: ext
CL: 1,8 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1,2,4 (default=1)
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*) bulk size: rszlo=rszhi=N (1,4,64,1024)
*) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8)
*) each OST is supposed to be configured on a separate disk.
*) the amount of objects is: nobjlo=nobjhi=[1;TH]
i.e. test only cases when all the threads work on the same file and
when all of them work on a separate file.
TEST=[4 bulks; common of separate file]=8 tests

Test matrix(TESTxTHxCLxCPUxOSSxOSTN):
1. Multi-thread test.
Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests
2. Multi-client test
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=8x1x4=32 tests.
3. Network test
Having IB results in hand after (1,2) and these results from NETT2, we  
already see how
osc+ost+ptlrpc+obdfilter changes the behavior. Thus, there is no  
reason to repeat them
for GiGe, it seems.
4.CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel.
At the same time, if some HW (network) limit is reached, the result  
will not be
very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 
1024K]:
[CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(3-1)=32.
5. OSTN test.
The same OSC, network, CPU, disk, just check how OST stack (see 1,2  
tests) is scalable.
5.1. Let''s check how N threads per 1 OST vs. 1 thread per N OST scales
(CL=OSTN).
5.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
As the different with (1,2) on the OSS part only, it is enough to test  
in separate directories
only.  It seems enough to look at 1 small & 1 large bulk only: [1,1024K]
[CL=OSTN,8;TH=1,8]. TEST=2. TESTxCLxTHxOSTN=2x2x2x2=16 tests
6. OSS test.
6.1. Let''s check how 1 thread per N OST vs. 1 thread per N OSS scales  
(CL=OSS).
6.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken
<8, if TH=8 reaches the maximum network throughput with small amount  
of clients.
As the different with (1,2) on the OSS part only, it is enough to test  
in separate directories
only. It seems enough to look at 1 small & 1 large bulk only: [1,1024K]
[CL=OSS,8;TH=1,8]. TEST=2. TESTxCLxTHxOSTN=2x2x2x2=16 tests

Total:128 tests

*** OSTT3. Client-OST test ***
lustre-iokit: ior-survey.

The test results are to be compared with OSTT2 results to get the  
overhead
for Lustre Client: client stack, distributed locking, etc.

RAM: fixed
CPU: 1,8,32 (default=1)
DISK: fake
JOUR: ext
CL: 1,8 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1,2,4 (default=1)
TH: 1,2,4,8 (default=1)
F: debug
TEST:
*) CL is specified through $clients_hi
*) TH is specified through $tasks_per_client_hi
*) bulk is specified through  rsize_lo/hi (1,4,64,1028K)
*) file_per_task=[0;1]
i.e. test only cases when all the threads work on the same file and
when all of them work on a separate file.
TEST=[4 bulks; common of separate file]=8 tests

Test matrix(TESTxTHxCLxCPUxOSSxOSTN): absolutely the same as for OSTT2.

NETT3. MD: (not ready)
MDST1. Isolated MDS (not ready)
MDST2. Remote MDS (not ready)
This set of tests need to be implemented in a utility similar to  
obdfilter-survey
but for MDS testing.

MDST3. Simple Client-MDS operation tests

1. create,mknod,mkdir (symlink, link??)
RAM: fixed
CPU(MDS): 1,8,32 (default=1)
DISK(MDS): ramdisk, raid (default=ramdisk)
DISK(OST): raid
JOUR(MDS): int,ext,ram (default=ext)
CL: 1,8 (default=1)
MDS:1,2,4 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it  
needs to be
fixed to support all of these operations, not only create. If so:
*) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8]
*) CL is specified through CLIENTS  or NODES_TO_USE.
*) NOSINGLE should be provided
*) add --dirnum option to COMMAND
*) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the
same dir and when each works in a separate one.
*) nfiles is files-per-dir * DIRNUM
[common or separate dir]=2tests;

Note: we should probably limit the amount of files in 1 directory with  
2M,
otherwise the performance will definitely downgrade.

Test matrix(TESTxTHxCLxCPUxMDSxOSSxDISKxJOUR):
1. Multi-thread test. (mknod)
Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7  
tests
(not 8 as if TH=1, DIRNUM=1, and this is already covered).
2. Multi-client test (mknod)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=2x1x4=8 tests.
3. OSS (create)
3.1. Let''s check how multi-client system scales (TH=OSS).
3.2. Let''s check how large load system scales (TH=8)
As the different with (1,2) on the OSS part only, it is enough to test  
in separate directories
only. Stripeness is [1, -1]. TEST=2. TESTxCLxTHxOSS=[2x2x2]x2 +  
[2x2x2]x1(1OSS case)=24
4. Network test
Having IB results in hand after (1,2,3) and these results from NETT1,  
we already see
how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to  
repeat them
for GiGe, it seems.
5. DISK test. (mknod)
Unlike the OST testing, we do not have echo-md client (MDTT1), thus we  
have not checked
how different disks impact the performance, so we need to check it here.
As difference disks are of completely different nature we need to  
repeat most of (1,2) here
[TH=1,8]: TESTxCLxTHxDISK=(2x2x2-1)x1=7
6. JOURNAL test. (mknod)
Repeat (5) for different journals, but limit the test with raid-disk  
only.
TESTxCLxTHxDISKxJOUR=(2x2x2-1)x1x2=15
7.CPU test (mknod)
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel, so run it for CL=8 only
TESTxCLxTHxCPU=2x1x4x(3-1)=16
8. CMD test. (mkdir)
8.1. Let''s check how N threads per 1 MDS vs. 1 thread per N MDS scales
(CL=MDS).
8.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients
These test happens in a separate directory (for each thread) only,  
enough to test with
nid creation policy only. TEST=1. TESTxCLxTHxMDS=1x2x4x2=16 tests.

Total: 16 tests for mkdir, 53 for mknod, 24 for create.

2. stat

RAM: fixed
CPU(MDS): 1,8,32 (default=1)
DISK(MDS): ramdisk
DISK(OST): raid
JOUR: ext
CL: 1,8 (default=1)
MDS: 1,2,4 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST: mdsrate/mdsrate-stat-small.sh
*) add THREADS_PER_CLIENT to the script to specify TH
*) CL is specified through CLIENTS  or NODES_TO_USE.
*) NOSINGLE should be provided
*) add --dirnum option to COMMAND
*) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the
same dir and when each works in a separate one.
*) nfiles is files-per-dir * DIRNUM
*) add READDIR_ORDER to test readdir access order (random order is not  
very
interesting for stat).
	[common or separate dir; readdir order]=2 tests.

Test matrix(TESTxTHxCLxCPUxMDSxOSS):
1. Multi-thread test.
Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7  
tests
(not 16 as if TH=1, DIRNUM=1, and this is already covered).
2. Multi-client test
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=2x1x4=8 tests.
3. OSS.
3.1. Let''s check how multi-client system scales (TH=OSS).
3.2. Let''s check how large load system scales (TH=8)
As the difference with (1,2) on the OSS part only, it is enough to  
test in separate directories
only. Test must be done for create with different stripeness: [1, -1].  
TEST=2.
TESTxCLxTHxOSS=[2x2x2]x2=16
4.CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel, so run it for CL=8  
only.
TESTxCLxTHxCPU=2x1x4x(3-1)=16
5. CMD test. (mkdir)
5.1. Let''s check how N threads per 1 MDS vs. 1 thread per N MDS scales
(CL=MDS).
5.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
1 creation policy (nid) is enough: TESTxCLxTHxMDS=2x2x4x2=32 tests.

Total: 79 tests.

3. unlink (mdsrate-create-small.sh)

RAM: fixed
CPU(MDS): 1,8,32 (default=1)
DISK(MDS): ramdisk, raid (default=ramdisk)
DISK(OST): raid
JOUR(MDS): int,ext,ram (default=ext)
CL: 1,8 (default=1)
MDS:1,2,4 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it  
needs to be
fixed to support all of these operations, not only create. If so:
*) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8]
*) CL is specified through CLIENTS  or NODES_TO_USE.
*) NOSINGLE should be provided
*) add --dirnum option to COMMAND
*) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the
same dir and when each works in a separate one.
*) nfiles is files-per-dir * DIRNUM
*) add an ability to remove in readdir order to mdsrate test and its  
script.
	[readdir or _create_ order; common or separate dir]=3 (skip readdir/ 
common dir).

Note: we should probably limit the amount of files in 1 directory with  
2M,
otherwise the performance will definitely downgrade.

Test matrix(TESTxTHxCLxCPUxMDSxOSSxDISKxJOUR):
1. Multi-thread test. (mknod)
Run TESTs on CL=1 with different amount of threads. TESTxTH=3x4-2=10  
tests
2. Multi-client test (mknod)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=3x1x4=12 tests.
3. OSS (create)
3.1. Let''s check how multi-client system scales (TH=OSS).
3.2. Let''s check how large load system scales (TH=8)
As the difference with (1,2) on the OSS part only, it is enough to  
test in separate directories
only. Stripeness is [1, -1]. TEST=4. TESTxCLxTHxOSS=[4x2x2]x2 +  
[4x2x2]x1(1OSS case)=48
4. Network test
Having IB results in hand after (1,2,3) and these results from NETT1,  
we already see
how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to  
repeat them
for GiGe, it seems.
5. DISK test. (mknod)
Unlike the OST testing, we do not have echo-md client (MDTT1), thus we  
have not checked
how different disks impact the performance, so we need to check it here.
As different disks are of completely different nature we need to  
repeat most of (1,2) here
[TH=1,8]: TESTxCLxTHxDISK=(3x2x2-2)x1=10
6. JOURNAL test. (mknod)
Repeat (5) for different journals, but limit the test with raid-disk  
only.
TESTxCLxTHxDISKxJOUR=(3x2x2-2)x2=20
7.CPU test (mknod)
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel, so run it for CL=8 only
TESTxCLxTHxCPU=3x1x4x(3-1)=24
8. CMD test. (mkdir)
8.1. Let''s check how N threads per 1 MDS vs. 1 thread per N MDS scales
(CL=MDS).
8.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients
These test happens in a separate directory (for each thread) only,  
enough to test with
nid creation policy only. TEST=2. TESTxCLxTHxMDS=2x2x4x2=32 tests.

Total: 32 tests for mkdir, 76 for mknod, 48 for create.

4. find (not ready)

**** MT. Mixed testing. ****

MT1. Create-write test.
RAM: fixed
CPU(MDS): 1,8,32 (default=32)
DISK(MDS): ramdisk, raid (default=ramdisk)
DISK(OST): raid
JOUR: ext
CL: 1,8 (default=1)
MDS: 1,2,4 (default=1)
OSS:1
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST: must be a new one. Each thread creates files in a loop, writes 1  
bulk to each and closes it.
*) it is enough to test with a small bulk only: [1k]
*) [common or separate dir]=2tests;

Test matrix(TESTxTHxCLxCPUxMDSxDISK):
1. Multi-thread test.
Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7  
tests
(not 8 as if TH=1, it is always in 1 dir, and this is already covered).
2. Multi-client test
Note: to be more demonstrative, the maximum amount of threads could be  
taken <8,
if TH=8 reaches the maximum network throughput with small amount of  
clients.
TESTxCLxTH=2x1x4=8 tests.
3. DISK test.
Check how different disks impact on the performance.
As different disks are of completely different nature we need to  
repeat most of (1,2) here
[TH=1,8]: TESTxCLxTHxDISK=(2x2x2-1)x1=7
4.CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are
going to benefit from handling them in parallel, so run it for CL=max
only: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=2x1x4x(4-1)=24
5. CMD test.
5.1. Let''s check how N threads per 1 MDS vs. 1 thread per N MDS scales
(CL=MDS).
5.2. Let''s check how the system scale with many clients and threads  
(CL=8)
Note: to be more demonstrative, the maximum amount of threads could be  
taken
<8, if TH=8 reaches the maximum network throughput with small amount  
of clients
These test happens in a separate directory (for each thread) only,  
creation policy=[nid,name].
TEST=2. TESTxCLxTHxMDS=2x2x4x2=32 tests.

Total: 78 tests.

MT2. Create-Readdir test.
RAM: fixed
CPU(MDS): 1,8,32 (default=32)
DISK(MDS): ramdisk, raid (default=ramdisk)
DISK(OST): raid
JOUR: ext
CL: 1,8 (default=1) (1 extra client does "ls -U")
MDS:1,2,4 (default=1)
OSS:1
NET: IB
OSTN:1
TH: 1,2,4,8 (default=1)
F: debug
TEST: must be a new one. Each thread creates files in a loop and  
immediately closes them.
1 thread on another client does "ls -U". It is done in 1 directory.

The test matrix is exactly the same as for MT1. Total: 78 tests.

MT3. untar a kernel.
MT4. pmake (compile a kernel).
RAM: fixed
CPU(MDS): 1,8,32 (default=32)
DISK(MDS): ramdisk, raid (default=ramdisk)
DISK(OST): raid
JOUR: ext
CL: 1
MDS:1,2,4 (default=1)
OSS:1,2,4 (default=1)
NET: IB
OSTN:1
TH: 1
F: debug
TEST: a new one.

Test matrix(TESTxCPUxMDSxOSSxDISK):
1. DISK test.
Check how different disks impact on the performance. TESTxDISK=1
2.CPU test
Note: lnet fixes from Liang to be applied here.
Run TESTs on different amount of CPU.
It is mostly interesting to look at large amount of threads, as we are  
going to benefit from
handling them in parallel, so run it for CL=max only: TESTxCPU=1x(3-1)=2
3. CMD test.
Creation policy=name. TESTxMDS=1x2=2 tests.
4. OSS
As most of the files are small, stripeness does not play any role (=1)
TESTxOSS=1x(3-1)=2.

Total: 7 tests.

MT5. ??? Some more tests ????

**** Goal2. Compare HEAD and b1_6 (b1_8) performance. ****

This paragraph describes the testing methodology in the reverse order  
of testing,
i.e. in the top-bottom direction, making sure new LustreFS (HEAD)  
version does
not downgrade comparing with the previous ones (b1_6/b1_8).

Therefore, the first testing cycle includes:
1) MT, MDST3, OSTT3, NETT1 from the above tests.
2) no CMD tests

In the case a downgrade is detected, lower layer tests are to be run  
until the downgrade
disappear.

**** Goal3. CMD testing. ****

MT, MDST3 tests, their CMD sections.

**** Goal4. Quick weekly MD performance test. ****

1) It covers tests described in MT,MDST sections.
2) MDST: No CPU,OSS,OSTN tests
3) MT: no MT1,MT2 tests
4) Only 1 node configuration:
	MDS on RAID1+0 with write back cache
	OSS on RAID6 with write through cache
	JOUR: external for both servers;
5) Only 1 network:
	IB;
6) Minimal amount of cluster configurations:
	MDS=1; OST=1; [CL,TH]=[1,1],[1,8],[8,8];

MDST1.1: perform only create (not mkdir,mknod) for [common or separate  
dir]=2.
1. Multi-thread test. TESTxTH=2x2-1=3 tests
2. Multi-client test. TESTxCLxTH=2x1x1=2 tests.
Total: 5 tests.

MDST1.2. stat for [common or separate dir; readdir order]=2 tests.
1. Multi-thread test. TESTxTH=2x2-1=3 tests
2. Multi-client test. TESTxCLxTH=2x1x1=2 tests.
Total: 5 tests.

MDST1.3 unlink for [readdir or _create_ order; common or separate dir]=3
(skip readdir/common dir). All tests are done against create (not  
mkdir,mknod).
1. Multi-thread test. TESTxTH=3x2-2=4 tests
2. Multi-client test (mknod) TESTxCLxTH=3x1x1=3 tests.
Total: 7 tests.

MT3. untar a kernel.
MT4. pmake (compile a kernel).
Total: 1 tests.

Total: 19 tests.

--
Vitaly

Andrew C. Uselton

2009-Mar-19 20:16 UTC

head link

[Lustre-devel] LustreFS performance (update)

Howdy Vitaly,
   I like this.  It is quite comprehensive and detailed.  I''d like to 
offer a few constructive criticisms in hope that you will better achieve 
your goals.  Mostly I''ll stick them in-line where they seem relevant, 
but I''ll start with:
1)  Your write up is quite dense and terse.  I could follow the overall 
structure, but found it pretty tough going to understand any specific 
detail.  It really helps to work with someone who will write up the same 
information, but in a form with whole sentences and a minimum of 
acronyms or special symbols.  Define the acronyms you do use in a clear 
way in one place that I can refer back to.


Vitaly Fertman wrote:> ****************************************************
> 	LustreFS benchmarking methodology.
> ****************************************************
> 
> The document aims to describe the benchmarking methodology which helps
> to understand the LustreFS performance and reveal LustreFS bottlenecks  
> in
> different configurations on different hardware, to ensure the next  
> LustreFS
> release does not downgrade comparing with a previous one. In other  
> words:
> 	Goal1. Understand the HEAD performance.
> 	Goal2. Compare HEAD and b1_6 (b1_8) performance.
> 
> To achieve the Goal1, the methodology suggests to test different  
> layers of
> software in the bottom-top direction, i.e. the underlying back-end,  
> the target
> server sitting on this back-end, the network connected to this target  
> and how
> the target performs through this network, etc up to the whole cluster.
I like this approach.  My own efforts tend to be at-scale testing at the 
whole-cluster end of the range, often in the presence of other cluster 
activity.  It is good to have the details of the underlying components 
documented.

...> Obviously, it is not possible to perform all the thousands of tests in  
> all the configurations,
> running all the special purpose tests, etc, the document tries to  
> prepare:
> 1) all the essential and sufficient tests to see how the system  
> performs in general;
> 2) some minimal amount of essential tests to see how the system scales  
> in different
> conditions.
In some cases it''s obvious, but in many it is not clear what exactly
you
mean to be testing.  It is a good extension to your methodology to state 
clearly not only the mechanics of the test itself, but what you think 
you are testing with the given experiment.  Spend a little time and 
describe what the system is under examination, how it responds or should 
respond to the proposed test, and what tunables and parameters you think 
might be relevant.  For instance, if the test is supposed to saturate 
the target server, then how much I/O do you expect will be required and 
why?  What timeout or other tunable may determine the observed 
saturation point.  Your goal should be to have, not only a test, but a 
real expectation about its results even before you run the test.  Once 
you have that expectation then you can evaluate the results.  The bottom 
up approach helps with this, since you can use the performance of the 
individual pieces to help establish your expectation about the larger 
assemblies.

...> **** Hardware Requirements. ****
> 
> The test plan implies that we change only 1 parameter (cpu or disk or  
> network)
> on each step. The HW requirements are:
> 
> -- at least 1 node with:
>   CPU:32;
>   RAM: enough to have a ramdisk for MDS;
>   DISK: enough disks for raid6 or raid1+0 (as this node could be mds  
> or ost);
> 	  an extra disk for external journal;
>   NET: both GiGe and IB installed.
> -- at least 1 another node includes:
>   DISK: enough disks for raid6 or raid1+0 (as this node could be mds  
> or ost);
> 	  an extra disk for external journal;
> -- besides that: 8 clients, 3 other servers.
> -- the other servers include:
>   DISK: raid6
>   NET: IB installed.
> -- client includes:
> NET: both GiGe and IB installed.
> 
> **** Software requirements ****
> You might provide links to these tests for those not familiar with
them.> 1. Short term.
> 1.1 mdsrate
> to be completed to test all the operations listed in MDST3 (see below).
> 1.2 mdsrate-**.sh
> to be fixed/written to run mdsrate properly and test all the  
> operations listed in
> MDST3 (see below).
> 1.3. fake disk
> implement FAIL flag and report ''done'' without doing
anything in
> obdfilter to get
> a low-latency disk.
> 1.4. MT.
> add more tests here and implement them.
> 
> 2. Long term.
> 2.1. mdtstack-survey
> - an echo client-server is to be written for mds similar to ost.
> - a test script similar to obdfilter-survey.sh is to be written.
> 
> **** Different configurations ****
> ...

I''ll cut it short here, but in general, I think you might be surprised 
that if you organize this document so that anyone else could come along 
behind you and perform all the same tests in the same way, you might get 
a lot of others doing these experiments along side you.  That would make 
your job a lot easier and increase the likelihood that bugs and 
regressions would be caught quickly.
> --
> Vitaly
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Cheers,
Andrew

parinay kondekar

2009-Mar-20 05:47 UTC

head link

[Lustre-devel] LustreFS performance (update)

The wiki :: https://wikis.clusterfs.com/intra/index.php/LustreFS_performance

~p

Vitaly Fertman wrote:> ****************************************************
> 	LustreFS benchmarking methodology.
> ****************************************************
>
>

Vitaly Fertman

2009-Mar-20 13:15 UTC

head link

[Lustre-devel] LustreFS performance (update)

Hi Andrew,

thanks for you feedback,
indeed, this still looks more like a raw test list than a ready for  
publishing
document, but this is a continuous work and I am still working on it,  
so I will
try to address you suggestions.

On Mar 19, 2009, at 11:16 PM, Andrew C. Uselton wrote:
> Howdy Vitaly,
>  I like this.  It is quite comprehensive and detailed.  I''d like
to
> offer a few constructive criticisms in hope that you will better  
> achieve your goals.  Mostly I''ll stick them in-line where they
seem
> relevant, but I''ll start with:
> 1)  Your write up is quite dense and terse.  I could follow the  
> overall structure, but found it pretty tough going to understand any  
> specific detail.  It really helps to work with someone who will  
> write up the same information, but in a form with whole sentences  
> and a minimum of acronyms or special symbols.  Define the acronyms  
> you do use in a clear way in one place that I can refer back to.
>
>
> Vitaly Fertman wrote:
>> ****************************************************
>> 	LustreFS benchmarking methodology.
>> ****************************************************
>> The document aims to describe the benchmarking methodology which  
>> helps
>> to understand the LustreFS performance and reveal LustreFS  
>> bottlenecks  in
>> different configurations on different hardware, to ensure the next   
>> LustreFS
>> release does not downgrade comparing with a previous one. In other   
>> words:
>> 	Goal1. Understand the HEAD performance.
>> 	Goal2. Compare HEAD and b1_6 (b1_8) performance.
>> To achieve the Goal1, the methodology suggests to test different   
>> layers of
>> software in the bottom-top direction, i.e. the underlying back- 
>> end,  the target
>> server sitting on this back-end, the network connected to this  
>> target  and how
>> the target performs through this network, etc up to the whole  
>> cluster.
>
> I like this approach.  My own efforts tend to be at-scale testing at  
> the whole-cluster end of the range, often in the presence of other  
> cluster activity.  It is good to have the details of the underlying  
> components documented.
>
> ...
>> Obviously, it is not possible to perform all the thousands of tests  
>> in  all the configurations,
>> running all the special purpose tests, etc, the document tries to   
>> prepare:
>> 1) all the essential and sufficient tests to see how the system   
>> performs in general;
>> 2) some minimal amount of essential tests to see how the system  
>> scales  in different
>> conditions.
>
> In some cases it''s obvious, but in many it is not clear what
exactly
> you mean to be testing.  It is a good extension to your methodology  
> to state clearly not only the mechanics of the test itself, but what  
> you think you are testing with the given experiment.  Spend a little  
> time and describe what the system is under examination, how it  
> responds or should respond to the proposed test, and what tunables  
> and parameters you think might be relevant.  For instance, if the  
> test is supposed to saturate the target server, then how much I/O do  
> you expect will be required and why?  What timeout or other tunable  
> may determine the observed saturation point.  Your goal should be to  
> have, not only a test, but a real expectation about its results even  
> before you run the test.  Once you have that expectation then you  
> can evaluate the results.  The bottom up approach helps with this,  
> since you can use the performance of the individual pieces to help  
> establish your expectation about the larger assemblies.
>
> ...
>> **** Hardware Requirements. ****
>> The test plan implies that we change only 1 parameter (cpu or disk  
>> or  network)
>> on each step. The HW requirements are:
>> -- at least 1 node with:
>>  CPU:32;
>>  RAM: enough to have a ramdisk for MDS;
>>  DISK: enough disks for raid6 or raid1+0 (as this node could be  
>> mds  or ost);
>> 	  an extra disk for external journal;
>>  NET: both GiGe and IB installed.
>> -- at least 1 another node includes:
>>  DISK: enough disks for raid6 or raid1+0 (as this node could be  
>> mds  or ost);
>> 	  an extra disk for external journal;
>> -- besides that: 8 clients, 3 other servers.
>> -- the other servers include:
>>  DISK: raid6
>>  NET: IB installed.
>> -- client includes:
>> NET: both GiGe and IB installed.
>> **** Software requirements ****
> You might provide links to these tests for those not familiar with  
> them.
>> 1. Short term.
>> 1.1 mdsrate
>> to be completed to test all the operations listed in MDST3 (see  
>> below).
>> 1.2 mdsrate-**.sh
>> to be fixed/written to run mdsrate properly and test all the   
>> operations listed in
>> MDST3 (see below).
>> 1.3. fake disk
>> implement FAIL flag and report ''done'' without doing
anything in
>> obdfilter to get
>> a low-latency disk.
>> 1.4. MT.
>> add more tests here and implement them.
>> 2. Long term.
>> 2.1. mdtstack-survey
>> - an echo client-server is to be written for mds similar to ost.
>> - a test script similar to obdfilter-survey.sh is to be written.
>> **** Different configurations ****
> ...
>
> I''ll cut it short here, but in general, I think you might be  
> surprised that if you organize this document so that anyone else  
> could come along behind you and perform all the same tests in the  
> same way, you might get a lot of others doing these experiments  
> along side you.  That would make your job a lot easier and increase  
> the likelihood that bugs and regressions would be caught quickly.
>
>> --
>> Vitaly
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
> Cheers,
> Andrew
--
Vitaly

Eric Barton

2009-Mar-24 00:34 UTC

head link

[Lustre-devel] LustreFS performance (update)

Vitaly,

I''ve been following this thread with great interest and I''d
like
to chat with you about this and also the MDS performance regression
tests.  Unfortunately, I''m unlikely to be able to do that this week
and it will probably have to wait until I''m back in the UK next
week.  In the mean time...

1. Have you got a rough idea how much work it would be to write the
   software that could exercise the MDD directly?  I''d just like to
   know if we''re talking days or weeks or months - we need to know
   that before we decide whether to do it.

2. I think Andrew Uselton''s comments are helpful.  We cannot afford
   routinely to sample the whole performance space - there are just
   too many dimensions.  So we need to develop a performance model
   that allows us to restrict the number of measurements we need
   to be confident that there are no surprises "in between" the
   points we have sampled.   

   That means we have to start running tests as soon as possible over
   as wide a parameter range as possible, with as much hardware as
   possible.  Then we''ll start to get a feel how much variability
   there is all over the space and where the "edges" and asymptotes
   are.

3. It''s worthwhile taking time to analyse and present results with
care.
   I''ve attached a spreadsheet that compares ping performance of a
single
   8-core server with varying numbers of clients and client threads,
   measured using different LNET locking schemes - hp (HEAD ping), 2lp
   (HEAD modified to split the LNET global lock into 2) and 3lp (same, but
   splitting the LNET global lock into 3).

   The lower row of graphs shows ping throughput versus number of client
   nodes, with different numbers of threads per node in each series.  The
   upper row of graphs shows the same ping throughput, but plotted
   against client threads totalled over all nodes, with different numbers
   of nodes in each series.  Please note....

   a) Set axis scaling correctly so that visual comparison is accurate.

   b) The upper row of graphs shows that it''s the total number of
threads
      exercising the server that''s most important - and that how those
      threads are distributed over client nodes seems to matter most when
      there are 8 of them.  That''s absolutely _not_ obvious from
looking
      at the lower row of graphs.

    Cheers,
              Eric
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces
at lists.lustre.org] On Behalf Of parinay
> kondekar
> Sent: 19 March 2009 10:47 PM
> To: Vitaly Fertman
> Cc: lustre-2.0-performance at sun.com; minh diep; Lustre Development
Mailing List
> Subject: Re: [Lustre-devel] LustreFS performance (update)
> 
> The wiki ::
https://wikis.clusterfs.com/intra/index.php/LustreFS_performance
> 
> ~p
> 
> Vitaly Fertman wrote:
> > ****************************************************
> > 	LustreFS benchmarking methodology.
> > ****************************************************
> >
> >
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part --------------
A non-text attachment was scrubbed...
Name: example graphs.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 51672 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090323/d4bac167/attachment-0001.bin

Lustre devel - Mar 2009 - LustreFS performance (update)

[Lustre-devel] LustreFS performance (update)

[Lustre-devel] LustreFS performance (update)

[Lustre-devel] LustreFS performance (update)

[Lustre-devel] LustreFS performance (update)

[Lustre-devel] LustreFS performance (update)