thr3ads.net - Lustre discuss - [Lustre-discuss] I/O performance on small files [May 2006]

If this information is useful, please help other people find it:
Share via:

Roland Fehrenbacher

2006-May-19 07:36 UTC

[Lustre-discuss] I/O performance on small files

Hi,

during my Lustre tests (vers. 1.4.1, kernel 2.6.12.5 patched with
bugzilla patches, 1 MDS, 2 OSTs, Gigabit network) I find extremely low
performance for I/O with small files like in extracting a Linux kernel
source. This is about 20 times slower than the performance on an
OST''s native filesystem. The same applies to the file
creation/deletion part of bonnie++ as shown in the below example.

Version 1.02b       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ni-01-01         8G 37512  75 55839  49 37416  91 37760  98 56343  90  59.7   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   199   3   304  13    56   0   305   5   380  16    64   0

These figures are almost 100 times lower than on the OST''s native
filesystem which is.

Version 1.02b       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
sn-03-1          8G 37693  96 181285  66 76051  36 30826  90 194777  61 313.61
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 25232 100 +++++ +++ 21106  97 24226  99 +++++ +++ 20339 100

Do I have to live with this, or is there a way to improve.

Thanks,

Roland

Roland Fehrenbacher

2006-May-19 07:36 UTC

head link

[Lustre-discuss] I/O performance on small files

>>>>> "Andreas" == Andreas Dilger
<adilger@clusterfs.com> writes:
    >> Also, when running bonnie++, my read performance is quite low
    >> compared to write, even though the OSTs have equal read/write
    >> throughput. Do you have an idea, where this could come from?

    Andreas> Traditionally Lustre has been better at write than read.

Ok.

    >> Concerning performance: When running bonnie++ on a single
    >> client with no I/O on any other client, I obtain only 100MB/s
    >> block write throughput, and 60MB/s block read throughput, even
    >> though each OST gives 200MB/s on the raw device. I''m using
    >> Infiniband TCP/IP as interconnect which gives me 300MB/s
    >> max. throughput.

    Andreas> You may consider increasing the
    Andreas> /proc/fs/lustre/osc/*/max_rpcs_in_flight parameter for
    Andreas> your clients.

What value would be appropriate?

    Andreas> What stripe count are you using?

stripe count is 0.

    Andreas>   If you need high single-client performance a stripe
    Andreas> count of 4 or will likely saturate your network.

Will this decrease parallel throughput?

    Andreas> Do you get better aggregate performance when multiple
    Andreas> clients are writing?  One of the strengths of Lustre is
    Andreas> that often the aggregate performance will increase as
    Andreas> more clients are added.

Yes I do.

    Andreas> We have also made several performance improvements for
    Andreas> newer lustre releases, and this is an ongoing process.
    Andreas> For some specific workloads there are tunings that will
    Andreas> improve things noticably, but aren''t suitable for
    Andreas> e.g. 1000-client HPC clusters so can''t go in by
default.

Well, this cluster has 170 nodes.

Thanks,

Roland

Nielsen, Steve

2006-May-19 07:36 UTC

head link

[Lustre-discuss] I/O performance on small files

Thanks for all the tips.  I will share when I get it working.

Steve

-----Original Message-----
From: lustre-discuss-admin@lists.clusterfs.com
[mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Roland
Fehrenbacher
Sent: Wednesday, August 31, 2005 9:48 AM
To: Andreas Dilger
Cc: lustre-discuss@clusterfs.com
Subject: Re: [Lustre-discuss] I/O performance on small files
>>>>> "Andreas" =3D=3D Andreas Dilger
<adilger@clusterfs.com> writes:
    Andreas> On Aug 24, 2005 12:54 +0200, Roland Fehrenbacher wrote:
    >> during my Lustre tests (vers. 1.4.1, kernel 2.6.12.5 patched
    >> with bugzilla patches, 1 MDS, 2 OSTs, Gigabit network) I find
    >> extremely low performance for I/O with small files like in
    >> extracting a Linux kernel source. This is about 20 times slower
    >> than the performance on an OST''s native filesystem. The
same
    >> applies to the file creation/deletion part of bonnie++ as shown
    >> in the below example.
    >>=20
    >> Do I have to live with this, or is there a way to improve.

    Andreas> In general, Lustre performs best for large files and
    Andreas> concurrent operation of many clients.  While the metadata
    Andreas> and small file performance of a single client is not
    Andreas> outstanding, it can scale efficiently to thousands of
    Andreas> clients doing concurrent operations.

    Andreas> For nodes which are expected to have a lot of interactive
    Andreas> use (e.g. login nodes) it is possible to increase the DLM
    Andreas> LRU size for these nodes to reduce interactive latency.
    Andreas> This can be done on a smallish number of nodes (10-20)
    Andreas> without problems, but isn''t optimal for all clients in
    Andreas> very large clusters.

    Andreas> for LRU in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
    Andreas> 	case LRU in
    Andreas> 	MDC*) echo 2000 > $LRU ;;
    Andreas> 	OSC*) echo 1000 > $LRU ;;
    Andreas> 	esac
    Andreas> done

This helped improve things indeed. Thanks for the hint.

    Andreas> This tuning has shown dramatic improvements for the
    Andreas> performance of tasks like untar/compile of a kernel which
    Andreas> touch a lot of small files.

During my stress tests I have occasional error messages on the OSS
nodes like:
[693471.980356] LustreError:
2542:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at
1125497780, 5s ago) req@ffff81007c75f800 x485514/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
[693471.985913] LustreError:
4337:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at
1125497780, 5s ago) req@ffff81002f7fa400 x485515/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
[693471.985960] LustreError:
4337:0:(recov_thread.c:396:log_commit_thread()) commit
ffff810030dce000:ffff81006a657e00 drop 128 cookies: rc -110
[693472.038458] LustreError:
2542:0:(recov_thread.c:396:log_commit_thread()) commit
ffff810003322000:ffff81006a657e00 drop 128 cookies: rc -110
[693472.400685] LustreError: 26226:0:(lib-move.c:162:lib_match_md())
2886796053: Dropping PUT from 2886796285.12345 portal 16 match 0x7688a
offset 0 length 64: no match

Is this serious?

Also, when running bonnie++, my read performance is quite low compared
to write, even though the OSTs have equal read/write throughput. Do
you have an idea, where this could come from?

Version 1.02b       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
ni-01-01         8G 49056  99 87752  78 40259  98 38538  99 62344  99
231.7   4
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  1392  24   950  41   907   8  1456  24   957  39
1108   9
ni-01-01,8G,49056,99,87752,78,40259,98,38538,99,62344,99,231.7,4,16,1392
,24,950,41,907,8,1456,24,957,39,1108,9

Concerning performance: When running bonnie++ on a single client with
no I/O on any other client, I obtain only 100MB/s block write
throughput, and 60MB/s block read throughput, even though each OST gives
200MB/s on the raw device. I''m using Infiniband TCP/IP as interconnect
which gives me 300MB/s max. throughput.

Thanks,

Roland

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] I/O performance on small files

On Aug 31, 2005  16:48 +0200, Roland Fehrenbacher wrote:> During my stress tests I have occasional error messages on the OSS
> nodes like:
> [693471.980356] LustreError:
2542:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at
1125497780, 5s ago) req@ffff81007c75f800 x485514/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
> [693471.985913] LustreError:
4337:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at
1125497780, 5s ago) req@ffff81002f7fa400 x485515/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
> [693471.985960] LustreError:
4337:0:(recov_thread.c:396:log_commit_thread()) commit
ffff810030dce000:ffff81006a657e00 drop 128 cookies: rc -110
> [693472.038458] LustreError:
2542:0:(recov_thread.c:396:log_commit_thread()) commit
ffff810003322000:ffff81006a657e00 drop 128 cookies: rc -110
> [693472.400685] LustreError: 26226:0:(lib-move.c:162:lib_match_md())
2886796053: Dropping PUT from 2886796285.12345 portal 16 match 0x7688a offset 0
length 64: no match
I believe this was resolved in a newer version of lustre.  These particular
timeouts are not serious.
> Also, when running bonnie++, my read performance is quite low compared
> to write, even though the OSTs have equal read/write throughput. Do
> you have an idea, where this could come from?
Traditionally Lustre has been better at write than read.
> Concerning performance: When running bonnie++ on a single client with
> no I/O on any other client, I obtain only 100MB/s block write
> throughput, and 60MB/s block read throughput, even though each OST gives
> 200MB/s on the raw device. I''m using Infiniband TCP/IP as
interconnect
> which gives me 300MB/s max. throughput.
You may consider increasing the /proc/fs/lustre/osc/*/max_rpcs_in_flight
parameter for your clients.  What stripe count are you using?  If you
need high single-client performance a stripe count of 4 or will likely
saturate your network.

Do you get better aggregate performance when multiple clients are writing?
One of the strengths of Lustre is that often the aggregate performance
will increase as more clients are added.

We have also made several performance improvements for newer lustre
releases, and this is an ongoing process.  For some specific workloads
there are tunings that will improve things noticably, but aren''t
suitable for e.g. 1000-client HPC clusters so can''t go in by default.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] I/O performance on small files

On Aug 24, 2005  12:54 +0200, Roland Fehrenbacher wrote:> during my Lustre tests (vers. 1.4.1, kernel 2.6.12.5 patched with
> bugzilla patches, 1 MDS, 2 OSTs, Gigabit network) I find extremely low
> performance for I/O with small files like in extracting a Linux kernel
> source. This is about 20 times slower than the performance on an
> OST''s native filesystem. The same applies to the file
> creation/deletion part of bonnie++ as shown in the below example.
> 
> Do I have to live with this, or is there a way to improve.
In general, Lustre performs best for large files and concurrent operation
of many clients.  While the metadata and small file performance of a
single client is not outstanding, it can scale efficiently to thousands
of clients doing concurrent operations.

For nodes which are expected to have a lot of interactive use (e.g. login
nodes) it is possible to increase the DLM LRU size for these nodes to
reduce interactive latency.  This can be done on a smallish number of
nodes (10-20) without problems, but isn''t optimal for all clients in
very
large clusters.

for LRU in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
	case LRU in
	MDC*) echo 2000 > $LRU ;;
	OSC*) echo 1000 > $LRU ;;
	esac
done

This tuning has shown dramatic improvements for the performance of tasks
like untar/compile of a kernel which touch a lot of small files.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Roland Fehrenbacher

2006-May-19 07:36 UTC

head link

[Lustre-discuss] I/O performance on small files

>>>>> "Andreas" == Andreas Dilger
<adilger@clusterfs.com> writes:
    Andreas> On Aug 24, 2005 12:54 +0200, Roland Fehrenbacher wrote:
    >> during my Lustre tests (vers. 1.4.1, kernel 2.6.12.5 patched
    >> with bugzilla patches, 1 MDS, 2 OSTs, Gigabit network) I find
    >> extremely low performance for I/O with small files like in
    >> extracting a Linux kernel source. This is about 20 times slower
    >> than the performance on an OST''s native filesystem. The
same
    >> applies to the file creation/deletion part of bonnie++ as shown
    >> in the below example.
    >> 
    >> Do I have to live with this, or is there a way to improve.

    Andreas> In general, Lustre performs best for large files and
    Andreas> concurrent operation of many clients.  While the metadata
    Andreas> and small file performance of a single client is not
    Andreas> outstanding, it can scale efficiently to thousands of
    Andreas> clients doing concurrent operations.

    Andreas> For nodes which are expected to have a lot of interactive
    Andreas> use (e.g. login nodes) it is possible to increase the DLM
    Andreas> LRU size for these nodes to reduce interactive latency.
    Andreas> This can be done on a smallish number of nodes (10-20)
    Andreas> without problems, but isn''t optimal for all clients in
    Andreas> very large clusters.

    Andreas> for LRU in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
    Andreas> 	case LRU in
    Andreas> 	MDC*) echo 2000 > $LRU ;;
    Andreas> 	OSC*) echo 1000 > $LRU ;;
    Andreas> 	esac
    Andreas> done

This helped improve things indeed. Thanks for the hint.

    Andreas> This tuning has shown dramatic improvements for the
    Andreas> performance of tasks like untar/compile of a kernel which
    Andreas> touch a lot of small files.

During my stress tests I have occasional error messages on the OSS
nodes like:
[693471.980356] LustreError: 2542:0:(client.c:815:ptlrpc_expire_one_request())
@@@ timeout (sent at 1125497780, 5s ago) req@ffff81007c75f800 x485514/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
[693471.985913] LustreError: 4337:0:(client.c:815:ptlrpc_expire_one_request())
@@@ timeout (sent at 1125497780, 5s ago) req@ffff81002f7fa400 x485515/t0
o401->@NET_0xac1103fd_UUID:15 lens 4168/64 ref 1 fl Rpc:/0/0 rc 0/0
[693471.985960] LustreError: 4337:0:(recov_thread.c:396:log_commit_thread())
commit ffff810030dce000:ffff81006a657e00 drop 128 cookies: rc -110
[693472.038458] LustreError: 2542:0:(recov_thread.c:396:log_commit_thread())
commit ffff810003322000:ffff81006a657e00 drop 128 cookies: rc -110
[693472.400685] LustreError: 26226:0:(lib-move.c:162:lib_match_md()) 2886796053:
Dropping PUT from 2886796285.12345 portal 16 match 0x7688a offset 0 length 64:
no match

Is this serious?

Also, when running bonnie++, my read performance is quite low compared
to write, even though the OSTs have equal read/write throughput. Do
you have an idea, where this could come from?

Version 1.02b       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ni-01-01         8G 49056  99 87752  78 40259  98 38538  99 62344  99 231.7   4
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1392  24   950  41   907   8  1456  24   957  39  1108   9
ni-01-01,8G,49056,99,87752,78,40259,98,38538,99,62344,99,231.7,4,16,1392,24,950,41,907,8,1456,24,957,39,1108,9

Concerning performance: When running bonnie++ on a single client with
no I/O on any other client, I obtain only 100MB/s block write
throughput, and 60MB/s block read throughput, even though each OST gives
200MB/s on the raw device. I''m using Infiniband TCP/IP as interconnect
which gives me 300MB/s max. throughput.

Thanks,

Roland

Roy Dragseth

2006-Jun-22 06:34 UTC

head link

[Lustre-discuss] I/O performance on small files

On Monday 29 August 2005 20:37, Andreas Dilger wrote:> For nodes which are expected to have a lot of interactive use (e.g. login
> nodes) it is possible to increase the DLM LRU size for these nodes to
> reduce interactive latency.  This can be done on a smallish number of
> nodes (10-20) without problems, but isn''t optimal for all clients
in very
> large clusters.
>
> for LRU in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
> 	case LRU in
> 	MDC*) echo 2000 > $LRU ;;
> 	OSC*) echo 1000 > $LRU ;;
> 	esac
> done
>
> This tuning has shown dramatic improvements for the performance of tasks
> like untar/compile of a kernel which touch a lot of small files.
>
Hi, this doesn''t seem to help anymore.  On our lustre setup version
1.4.6.2 we
see very good io performance (600MB/s using infiniband) on large files, but 
small file performance is really bad.  For instance untarring the linux 
kernel source is 20 times slower on the lustre file system than on local 
disk:

[royd@compute-1-2 c1-2]$ cd /mnt/lustre
[royd@compute-1-2 lustre]$ time tar xf /tmp/linux-2.6.9.tar

real    1m7.848s
user    0m0.203s
sys     0m32.832s
[royd@compute-1-2 lustre]$ cd /tmp
[royd@compute-1-2 tmp]$ time tar xf /tmp/linux-2.6.9.tar

real    0m2.590s
user    0m0.169s
sys     0m1.470s

I''ve also tried to change the max_rpcs_in_flight parameter mentioned
later in
this thread, but that doesn''t seem to help either.

We would like to deploy a pair of interactive nodes where we want to beef up 
the small file performance.  Are there any more knobs to turn?

Our setup is like this:
OSTs: 2 Nexan Satabeasts, 10TB each, in the above case 4 OSTs of 1.5TB on 
each.
OSSs: 2 HP Proliant 380s (i386) with FC channels to the beasts and infiniband 
to the clients.  One of these is the MDS too.
100 HP rx4640s (ia64) as clients.
CentOS 4.2, with the latest errata kernel, 2.6.9-34EL, lustre 1.4.6.2 patches 
and Voltaire IBHOST stack.

Any hints is greatly appreciated.

Regards,
r.

-- 

  The Computer Center, University of Troms?, N-9037 TROMS? Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd@cc.uit.no

Jean-Marc Saffroy

2006-Jun-22 07:30 UTC

head link

[Lustre-discuss] I/O performance on small files

On Thu, 22 Jun 2006, Roy Dragseth wrote:
>> This tuning has shown dramatic improvements for the performance of
tasks
>> like untar/compile of a kernel which touch a lot of small files.
>>
>
> Hi, this doesn''t seem to help anymore.
Have you tried *really* larger values, ie. in the order of the number of 
files in the tarball (eg. 30000 for MDS LRU)?

I''m not sure how much this is advisable though. ;)


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Andreas Dilger

2006-Jun-22 15:36 UTC

head link

[Lustre-discuss] I/O performance on small files

On Jun 22, 2006  14:34 +0200, Roy Dragseth wrote:> On our lustre setup version 1.4.6.2 we see very good io performance
> (600MB/s using infiniband) on large files, but small file performance
> is really bad.  For instance untarring the linux kernel source is 20 times
> slower on the lustre file system than on local disk:
> 
> [royd@compute-1-2 c1-2]$ cd /mnt/lustre
> [royd@compute-1-2 lustre]$ time tar xf /tmp/linux-2.6.9.tar
> 
> real    1m7.848s
> user    0m0.203s
> sys     0m32.832s
> [royd@compute-1-2 lustre]$ cd /tmp
> [royd@compute-1-2 tmp]$ time tar xf /tmp/linux-2.6.9.tar
> 
> real    0m2.590s
> user    0m0.169s
> sys     0m1.470s
> 
> I''ve also tried to change the max_rpcs_in_flight parameter
mentioned later in
> this thread, but that doesn''t seem to help either.
How large is your tarball?  My 2.6.9-34.EL kernel is 201MB, so this is
exceeding the Lustre-imposed maximum client cache size (32 MB per OSC).

To make this a fair test, in addition to increasing the lock LRU size you
should also increase the /proc/fs/lustre/osc/*/max_dirty_mb value to,
say, 128MB so that the client can cache as much of the dataset locally
as possible, and then flush it out in the background.  Also, it would
be prudent to do the "local" benchmark on the OSS node mounting one
of the OST filesystems temporarily (after stopping lustre of course)
so that the same disk hardware is used.

The local filesystem isn''t writing all of the tarball to disk before
returns, and it is likely caching all of it.  Lustre does more aggressive
write flushing than local filesystems, because it is undesirable to have
many GB of outstanding writes in client cache when there are thousands
of clients.  When there are a smaller number of clients doing this kind
of operation these restrictions can be removed.
> Our setup is like this:
> OSTs: 2 Nexan Satabeasts, 10TB each, in the above case 4 OSTs of 1.5TB on 
> each.
> OSSs: 2 HP Proliant 380s (i386) with FC channels to the beasts and
infiniband
> to the clients.  One of these is the MDS too.
> 100 HP rx4640s (ia64) as clients.
Just to confirm, what is the number of stripes per file?  Having more
than a single stripe on small files is pure overhead.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Roy Dragseth

2006-Jun-23 03:10 UTC

head link

[Lustre-discuss] I/O performance on small files

On Thursday 22 June 2006 23:36, Andreas Dilger wrote:> On Jun 22, 2006  14:34 +0200, Roy Dragseth wrote:
> > On our lustre setup version 1.4.6.2 we see very good io performance
> > (600MB/s using infiniband) on large files, but small file performance
> > is really bad.  For instance untarring the linux kernel source is 20
> > times slower on the lustre file system than on local disk:
> >
> > [royd@compute-1-2 c1-2]$ cd /mnt/lustre
> > [royd@compute-1-2 lustre]$ time tar xf /tmp/linux-2.6.9.tar
> >
> > real    1m7.848s
> > user    0m0.203s
> > sys     0m32.832s
> > [royd@compute-1-2 lustre]$ cd /tmp
> > [royd@compute-1-2 tmp]$ time tar xf /tmp/linux-2.6.9.tar
> >
> > real    0m2.590s
> > user    0m0.169s
> > sys     0m1.470s
> >
> > I''ve also tried to change the max_rpcs_in_flight parameter
mentioned
> > later in this thread, but that doesn''t seem to help either.
>
> How large is your tarball?  My 2.6.9-34.EL kernel is 201MB, so this is
> exceeding the Lustre-imposed maximum client cache size (32 MB per OSC).
#ll -h /tmp/linux-2.6.9.tar
-rw-r--r--  1 root root 196M Jun 23 08:55 /tmp/linux-2.6.9.tar
>
> To make this a fair test, in addition to increasing the lock LRU size you
> should also increase the /proc/fs/lustre/osc/*/max_dirty_mb value to,
> say, 128MB so that the client can cache as much of the dataset locally
> as possible, and then flush it out in the background.  Also, it would
> be prudent to do the "local" benchmark on the OSS node mounting
one
> of the OST filesystems temporarily (after stopping lustre of course)
> so that the same disk hardware is used.
I mounted one of the devices as /lshared1 and reran on the OSS, it shows a 
unpacking time about the same as previous:

# cd /lshared1
# time tar xf /tmp/linux-2.6.9.tar

real    0m2.557s
user    0m0.170s
sys     0m1.677s

Increasing the max_dirty_mb didn''t help either:

# cat /proc/fs/lustre/osc/*/max_dirty_mb
256
256
256
256
256
256
256
256

$ cd /mnt/lustre
$ time tar xf /tmp/linux-2.6.9.tar

real    0m42.357s
user    0m0.199s
sys     0m21.379s

>
> The local filesystem isn''t writing all of the tarball to disk
before
> returns, and it is likely caching all of it.  Lustre does more aggressive
> write flushing than local filesystems, because it is undesirable to have
> many GB of outstanding writes in client cache when there are thousands
> of clients.  When there are a smaller number of clients doing this kind
> of operation these restrictions can be removed.
Yes, but even if I include the sync time it still runs around 10 times faster 
against local disk than over lustre:

# time bash -c "tar xf /tmp/linux-2.6.9.tar ; sync"

real    0m4.363s
user    0m0.168s
sys     0m1.933s
>
> > Our setup is like this:
> > OSTs: 2 Nexan Satabeasts, 10TB each, in the above case 4 OSTs of 1.5TB
on
> > each.
> > OSSs: 2 HP Proliant 380s (i386) with FC channels to the beasts and
> > infiniband to the clients.  One of these is the MDS too.
> > 100 HP rx4640s (ia64) as clients.
>
> Just to confirm, what is the number of stripes per file?  Having more
> than a single stripe on small files is pure overhead.
The default is one stripe per file:

lmc -m $CONFIG --add lov --lov lov-work --mds mds-work --stripe_sz 
1048576 --stripe_cnt 0 --stripe_pattern 0


We run the MDS filesystem on the same arrays that is hosting the OSTs, but 
moving the  mds to a ramfs, e.g. /dev/shm, doesn''t seem to affect
performance
at all.

But, it seems to me like the overhead is in the file creation as this little 
experiment shows:

first we collect the directory structure in the linux tarball, then all the 
filenames.  

#cd /tmp
#tar xf /tmp/linux-2.6.9.tar
#find linux-2.6.9 -type d > /tmp/linuxsrcdirs.txt
#find linux-2.6.9 -type f > /tmp/linuxsrcfiles.txt

Creating the dir structure is really fast, creating the files is really slow:

# cd /mnt/lustre
# time bash -c "cat /tmp/linuxsrcdirs.txt | xargs mkdir"
real    0m0.651s
user    0m0.005s
sys     0m0.260s

# time bash -c "cat /tmp/linuxsrcfiles.txt | xargs touch"
real    0m32.470s
user    0m0.135s
sys     0m16.546s

So, in this case 32 of the 42 seconds seems to be spent in creating the files.

Attached you''ll find the script used to create the filesystem, maybe it
is
something obvious I''m doing wrong?

Regards,
r.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: createLustreStripe.sh.zip
Type: application/x-zip
Size: 1018 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060623/4e5bced5/createLustreStripe.sh-0001.bin

David Vasil

2006-Jun-23 06:13 UTC

head link

[Lustre-discuss] I/O performance on small files

Roy Dragseth wrote:> #cd /tmp
> #tar xf /tmp/linux-2.6.9.tar
> #find linux-2.6.9 -type d > /tmp/linuxsrcdirs.txt
> #find linux-2.6.9 -type f > /tmp/linuxsrcfiles.txt
> 
> Creating the dir structure is really fast, creating the files is really
slow:
> 
> # cd /mnt/lustre
> # time bash -c "cat /tmp/linuxsrcdirs.txt | xargs mkdir"
> real    0m0.651s
> user    0m0.005s
> sys     0m0.260s
> 
> # time bash -c "cat /tmp/linuxsrcfiles.txt | xargs touch"
> real    0m32.470s
> user    0m0.135s
> sys     0m16.546s
> 
> So, in this case 32 of the 42 seconds seems to be spent in creating the
files.
One problem with this test is that there is over an order of magnitude
more files than directories in the linux source tree.  While I dont
contend that this will shed more light on your problem, it would be
better to do an apples to apples test.

-- 
| David Vasil <dmvasil@ornl.gov>
| Oak Ridge National Laboratory NCCS Division
| High Performance Computing Systems Administrator

Roy Dragseth

2006-Jun-23 06:40 UTC

head link

[Lustre-discuss] I/O performance on small files

On Friday 23 June 2006 14:12, David Vasil wrote:> One problem with this test is that there is over an order of magnitude
> more files than directories in the linux source tree. ?While I dont
> contend that this will shed more light on your problem, it would be
> better to do an apples to apples test.
It was intended more as a breakdown of events concerning the metadata than a 
comparison between the speed of directory creation and file creation.

r.

-- 

  The Computer Center, University of Troms?, N-9037 TROMS? Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd@cc.uit.no

Peter J. Braam

2006-Jun-23 07:16 UTC

head link

[Lustre-discuss] I/O performance on small files

Hi Roy,

Boy, it''s a pleasure to work through this with you.  Iirc we _should_
be
able to do well on unpacking, but it has been a while.  However, until
we have our metadata writeback cache (2008?), we cannot win over a local
file system.  In a local file system the cache flushes write the newly
created files to disk in huge batches.

Let''s first look at an order of magnitude issue here.  I suspect you
have about 15,000 files perhaps? So the creation data below shows that
you create about 500/sec.  On big machines we see up to ~14,000 creates
/ second, on smaller systems maybe a few 1000. If we assume 2000, that
means you''ll still be sitting in file creations for 7.5 seconds.   What
kind of an MDS do you have?

First, let''s turn debugging off completely, on all nodes please:

echo 0 > /proc/sys/portals/debug

Then try again.

The files has objects on the OST which are supposed to be pre-created,
and it shouldn''t interfere too much with performance.  Normal file
creation performance is on-par with directory creation.  But something
there is clearly awry.  

Could you see what "create_count" is set to in the OSC
/proc/fs/lustre/OSC/*/... directories?  Putting something like a 

Let me also ask you, how wide are you striping the files (lfs getstripe
<unpackdir>)?  For best performance you want to set the stripe count on
a subdirectory to 1:

lfs setstripe <unpack-dir>  4194304  -1 1

Try these (one at a time please) and let us know how you are doing.  It
may unfortunately take a few more iterations to get this right.

- Peter -




> -----Original Message-----
> From: lustre-discuss-bounces@clusterfs.com 
> [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of 
> Roy Dragseth
> Sent: Friday, June 23, 2006 3:10 AM
> To: lustre-discuss@clusterfs.com
> Subject: Re: [Lustre-discuss] I/O performance on small files
> 
> On Thursday 22 June 2006 23:36, Andreas Dilger wrote:
> > On Jun 22, 2006  14:34 +0200, Roy Dragseth wrote:
> > > On our lustre setup version 1.4.6.2 we see very good io 
> performance 
> > > (600MB/s using infiniband) on large files, but small file 
> > > performance is really bad.  For instance untarring the 
> linux kernel 
> > > source is 20 times slower on the lustre file system than 
> on local disk:
> > >
> > > [royd@compute-1-2 c1-2]$ cd /mnt/lustre
> > > [royd@compute-1-2 lustre]$ time tar xf /tmp/linux-2.6.9.tar
> > >
> > > real    1m7.848s
> > > user    0m0.203s
> > > sys     0m32.832s
> > > [royd@compute-1-2 lustre]$ cd /tmp
> > > [royd@compute-1-2 tmp]$ time tar xf /tmp/linux-2.6.9.tar
> > >
> > > real    0m2.590s
> > > user    0m0.169s
> > > sys     0m1.470s
> > >
> > > I''ve also tried to change the max_rpcs_in_flight 
> parameter mentioned 
> > > later in this thread, but that doesn''t seem to help
either.
> >
> > How large is your tarball?  My 2.6.9-34.EL kernel is 201MB, 
> so this is 
> > exceeding the Lustre-imposed maximum client cache size (32 
> MB per OSC).
> 
> #ll -h /tmp/linux-2.6.9.tar
> -rw-r--r--  1 root root 196M Jun 23 08:55 /tmp/linux-2.6.9.tar
> 
> >
> > To make this a fair test, in addition to increasing the 
> lock LRU size 
> > you should also increase the 
> /proc/fs/lustre/osc/*/max_dirty_mb value 
> > to, say, 128MB so that the client can cache as much of the dataset 
> > locally as possible, and then flush it out in the 
> background.  Also, 
> > it would be prudent to do the "local" benchmark on the OSS
node
> > mounting one of the OST filesystems temporarily (after 
> stopping lustre 
> > of course) so that the same disk hardware is used.
> 
> I mounted one of the devices as /lshared1 and reran on the 
> OSS, it shows a unpacking time about the same as previous:
> 
> # cd /lshared1
> # time tar xf /tmp/linux-2.6.9.tar
> 
> real    0m2.557s
> user    0m0.170s
> sys     0m1.677s
> 
> Increasing the max_dirty_mb didn''t help either:
> 
> # cat /proc/fs/lustre/osc/*/max_dirty_mb
> 256
> 256
> 256
> 256
> 256
> 256
> 256
> 256
> 
> $ cd /mnt/lustre
> $ time tar xf /tmp/linux-2.6.9.tar
> 
> real    0m42.357s
> user    0m0.199s
> sys     0m21.379s
> 
> 
> >
> > The local filesystem isn''t writing all of the tarball to 
> disk before 
> > returns, and it is likely caching all of it.  Lustre does more 
> > aggressive write flushing than local filesystems, because it is 
> > undesirable to have many GB of outstanding writes in client 
> cache when 
> > there are thousands of clients.  When there are a smaller number of 
> > clients doing this kind of operation these restrictions can 
> be removed.
> 
> Yes, but even if I include the sync time it still runs around 
> 10 times faster against local disk than over lustre:
> 
> # time bash -c "tar xf /tmp/linux-2.6.9.tar ; sync"
> 
> real    0m4.363s
> user    0m0.168s
> sys     0m1.933s
> 
> >
> > > Our setup is like this:
> > > OSTs: 2 Nexan Satabeasts, 10TB each, in the above case 4 OSTs of 
> > > 1.5TB on each.
> > > OSSs: 2 HP Proliant 380s (i386) with FC channels to the 
> beasts and 
> > > infiniband to the clients.  One of these is the MDS too.
> > > 100 HP rx4640s (ia64) as clients.
> >
> > Just to confirm, what is the number of stripes per file?  
> Having more 
> > than a single stripe on small files is pure overhead.
> 
> The default is one stripe per file:
> 
> lmc -m $CONFIG --add lov --lov lov-work --mds mds-work --stripe_sz
> 1048576 --stripe_cnt 0 --stripe_pattern 0
> 
> 
> We run the MDS filesystem on the same arrays that is hosting 
> the OSTs, but 
> moving the  mds to a ramfs, e.g. /dev/shm, doesn''t seem to 
> affect performance 
> at all.
> 
> But, it seems to me like the overhead is in the file creation 
> as this little 
> experiment shows:
> 
> first we collect the directory structure in the linux 
> tarball, then all the 
> filenames.  
> 
> #cd /tmp
> #tar xf /tmp/linux-2.6.9.tar
> #find linux-2.6.9 -type d > /tmp/linuxsrcdirs.txt
> #find linux-2.6.9 -type f > /tmp/linuxsrcfiles.txt
> 
> Creating the dir structure is really fast, creating the files 
> is really slow:
> 
> # cd /mnt/lustre
> # time bash -c "cat /tmp/linuxsrcdirs.txt | xargs mkdir"
> real    0m0.651s
> user    0m0.005s
> sys     0m0.260s
> 
> # time bash -c "cat /tmp/linuxsrcfiles.txt | xargs touch"
> real    0m32.470s
> user    0m0.135s
> sys     0m16.546s
> 
> So, in this case 32 of the 42 seconds seems to be spent in 
> creating the files.
> 
> Attached you''ll find the script used to create the 
> filesystem, maybe it is 
> something obvious I''m doing wrong?
> 
> Regards,
> r.
>

Kumaran Rajaram

2006-Jun-23 09:36 UTC

head link

[Lustre-discuss] I/O performance on small files

Roy,

Besides kernel untar, I''d suggest "fileop" metadata benchmark
which
gives operations/second for various metadata operations. Its part of
IOzone benchmark suite. Sample output is given below.


/mnt/scratch_256/fileop 60

   --------------------------------------
     |              Fileop                |
     |         $Revision: 1.20 $          |
     |                                    |
     |                by                  |
     |                                    |
     |             Don Capps              |
     --------------------------------------

mkdir:   Dirs =      3660 Total Time =  0.281478405 seconds
         Avg mkdir(s)/sec     =     13002.77 ( 0.000076907 seconds/op)
         Best mkdir(s)/sec    =     72315.59 ( 0.000013828 seconds/op)
         Worst mkdir(s)/sec   =        30.96 ( 0.032297134 seconds/op)

rmdir:   Dirs =      3660 Total Time =  0.320229292 seconds
         Avg rmdir(s)/sec     =     11429.31 ( 0.000087494 seconds/op)
         Best rmdir(s)/sec    =     67650.06 ( 0.000014782 seconds/op)
         Worst rmdir(s)/sec   =       163.94 ( 0.006099939 seconds/op)

create:  Files =    216000 Total Time = 32.995336533 seconds
         Avg create(s)/sec    =      6546.38 ( 0.000152756 seconds/op)
         Best create(s)/sec   =     72315.59 ( 0.000013828 seconds/op)
         Worst create(s)/sec  =         0.55 ( 1.808913946 seconds/op)

write:   Files =    216000 Total Time =  1.749092102 seconds
         Avg write(s)/sec     =    123492.64 ( 0.000008098 seconds/op)
         Best write(s)/sec    =    167772.16 ( 0.000005960 seconds/op)
         Worst write(s)/sec   =       495.55 ( 0.002017975 seconds/op)

close:   Files =    216000 Total Time = 38.892731667 seconds
         Avg close(s)/sec     =      5553.74 ( 0.000180059 seconds/op)
         Best close(s)/sec    =    209715.20 ( 0.000004768 seconds/op)
         Worst close(s)/sec   =         0.38 ( 2.653007030 seconds/op)

stat:    Files =    216000 Total Time =  0.476946831 seconds
         Avg stat(s)/sec      =    452880.67 ( 0.000002208 seconds/op)
         Best stat(s)/sec     =   1048576.00 ( 0.000000954 seconds/op)
         Worst stat(s)/sec    =     16644.06 ( 0.000060081 seconds/op)

access:  Files =    216000 Total Time =  0.510522604 seconds
         Avg access(s)/sec    =    423095.86 ( 0.000002364 seconds/op)
         Best access(s)/sec   =   1048576.00 ( 0.000000954 seconds/op)
         Worst access(s)/sec  =      1709.17 ( 0.000585079 seconds/op)

chmod:   Files =    216000 Total Time = 17.839460373 seconds
         Avg chmod(s)/sec     =     12107.99 ( 0.000082590 seconds/op)
         Best chmod(s)/sec    =    262144.00 ( 0.000003815 seconds/op)
         Worst chmod(s)/sec   =         0.45 ( 2.227634907 seconds/op)

readdir: Files =      3600 Total Time =  0.052351236 seconds
         Avg readdir(s)/sec   =     68766.28 ( 0.000014542 seconds/op)
         Best readdir(s)/sec  =     83886.08 ( 0.000011921 seconds/op)
         Worst readdir(s)/sec =     11491.24 ( 0.000087023 seconds/op)

link:    Files =    216000 Total Time = 42.742334366 seconds
         Avg link(s)/sec      =      5053.54 ( 0.000197881 seconds/op)
         Best link(s)/sec     =     91180.52 ( 0.000010967 seconds/op)
         Worst link(s)/sec    =         0.50 ( 1.982422113 seconds/op)

unlink:  Files =    216000 Total Time = 31.314268112 seconds
         Avg unlink(s)/sec    =      6897.81 ( 0.000144973 seconds/op)
         Best unlink(s)/sec   =    113359.57 ( 0.000008821 seconds/op)
         Worst unlink(s)/sec  =         0.21 ( 4.793473005 seconds/op)

delete:  Files =    216000 Total Time = 55.935692072 seconds
         Avg delete(s)/sec    =      3861.58 ( 0.000258962 seconds/op)
         Best delete(s)/sec   =     29537.35 ( 0.000033855 seconds/op)
         Worst delete(s)/sec  =         0.65 ( 1.544568062 seconds/op)


>>> Roy Dragseth <Roy.Dragseth@cc.uit.no> 6/23/2006 6:40 AM
>>>
On Friday 23 June 2006 14:12, David Vasil wrote:> One problem with this test is that there is over an order of
magnitude> more files than directories in the linux source tree.  While I dont
> contend that this will shed more light on your problem, it would be
> better to do an apples to apples test.
It was intended more as a breakdown of events concerning the metadata
than a 
comparison between the speed of directory creation and file creation.

r.

-- 

  The Computer Center, University of Troms?, N-9037 TROMS? Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd@cc.uit.no 
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com 
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Iozone

2006-Jun-23 09:56 UTC

head link

[Lustre-discuss] I/O performance on small files

Kumaran,

    One might want to pickup the latest version from
    the Iozone web site. The current version is Revision 1.37

    The new version has many new nifty features:

     --------------------------------------
     |                      Fileop                        |
     |              $Revision: 1.37 $               |
     |                                                       |
     |                         by                          |
     |                                                       |
     |                  Don Capps                   |
     --------------------------------------

     fileop [-f  # ]  [-l # -u #] [-s Y] [-t] [-v] [-e] [-b] -[w]

     -f # Force factor. X^3 files will be created and removed.
     -l # Lower limit on the value of the Force factor. (optional)
     -u # Upper limit on the value of the Force factor.(optional)
     -s # Optional. Sets filesize for the create/write.(optional)
     -t # Verbose output option.(optional)
     -v # Version information.(optional)
     -e # Excel importable format.(optional)
     -b Output best case results(optional)
     -w Output worst case results(optional)

     The structure of the file tree is:
     X number of Level 1 directories, with X number of
     level 2 directories, with X number of files in each
     of the level 2 directories.

     Example:  fileop 2

             dir_1                           dir_2
            /         \                        /         \
      sdir_1     sdir_2             sdir_1       sdir_2
      /     \           /      \           /       \        /       \
   file_1 file_2 file_1 file_2   file_1 file_2 file_1 file_2

   Each file will be created, and then Y bytes is written to the file.

Enjoy,
Don Capps

----- Original Message ----- 
From: "Kumaran Rajaram" <krajaram@lnxi.com>
To: "Roy Dragseth" <Roy.Dragseth@cc.uit.no>;
<lustre-discuss@clusterfs.com>
Sent: Friday, June 23, 2006 10:35 AM
Subject: Re: [Lustre-discuss] I/O performance on small files

> Roy,
>
> Besides kernel untar, I''d suggest "fileop" metadata
benchmark which
> gives operations/second for various metadata operations. Its part of
> IOzone benchmark suite. Sample output is given below.
>
>
> /mnt/scratch_256/fileop 60
>
>   --------------------------------------
>     |              Fileop                |
>     |         $Revision: 1.20 $          |
>     |                                    |
>     |                by                  |
>     |                                    |
>     |             Don Capps              |
>     --------------------------------------
>
> mkdir:   Dirs =      3660 Total Time =  0.281478405 seconds
>         Avg mkdir(s)/sec     =     13002.77 ( 0.000076907 seconds/op)
>         Best mkdir(s)/sec    =     72315.59 ( 0.000013828 seconds/op)
>         Worst mkdir(s)/sec   =        30.96 ( 0.032297134 seconds/op)
>
> rmdir:   Dirs =      3660 Total Time =  0.320229292 seconds
>         Avg rmdir(s)/sec     =     11429.31 ( 0.000087494 seconds/op)
>         Best rmdir(s)/sec    =     67650.06 ( 0.000014782 seconds/op)
>         Worst rmdir(s)/sec   =       163.94 ( 0.006099939 seconds/op)
>
> create:  Files =    216000 Total Time = 32.995336533 seconds
>         Avg create(s)/sec    =      6546.38 ( 0.000152756 seconds/op)
>         Best create(s)/sec   =     72315.59 ( 0.000013828 seconds/op)
>         Worst create(s)/sec  =         0.55 ( 1.808913946 seconds/op)
>
> write:   Files =    216000 Total Time =  1.749092102 seconds
>         Avg write(s)/sec     =    123492.64 ( 0.000008098 seconds/op)
>         Best write(s)/sec    =    167772.16 ( 0.000005960 seconds/op)
>         Worst write(s)/sec   =       495.55 ( 0.002017975 seconds/op)
>
> close:   Files =    216000 Total Time = 38.892731667 seconds
>         Avg close(s)/sec     =      5553.74 ( 0.000180059 seconds/op)
>         Best close(s)/sec    =    209715.20 ( 0.000004768 seconds/op)
>         Worst close(s)/sec   =         0.38 ( 2.653007030 seconds/op)
>
> stat:    Files =    216000 Total Time =  0.476946831 seconds
>         Avg stat(s)/sec      =    452880.67 ( 0.000002208 seconds/op)
>         Best stat(s)/sec     =   1048576.00 ( 0.000000954 seconds/op)
>         Worst stat(s)/sec    =     16644.06 ( 0.000060081 seconds/op)
>
> access:  Files =    216000 Total Time =  0.510522604 seconds
>         Avg access(s)/sec    =    423095.86 ( 0.000002364 seconds/op)
>         Best access(s)/sec   =   1048576.00 ( 0.000000954 seconds/op)
>         Worst access(s)/sec  =      1709.17 ( 0.000585079 seconds/op)
>
> chmod:   Files =    216000 Total Time = 17.839460373 seconds
>         Avg chmod(s)/sec     =     12107.99 ( 0.000082590 seconds/op)
>         Best chmod(s)/sec    =    262144.00 ( 0.000003815 seconds/op)
>         Worst chmod(s)/sec   =         0.45 ( 2.227634907 seconds/op)
>
> readdir: Files =      3600 Total Time =  0.052351236 seconds
>         Avg readdir(s)/sec   =     68766.28 ( 0.000014542 seconds/op)
>         Best readdir(s)/sec  =     83886.08 ( 0.000011921 seconds/op)
>         Worst readdir(s)/sec =     11491.24 ( 0.000087023 seconds/op)
>
> link:    Files =    216000 Total Time = 42.742334366 seconds
>         Avg link(s)/sec      =      5053.54 ( 0.000197881 seconds/op)
>         Best link(s)/sec     =     91180.52 ( 0.000010967 seconds/op)
>         Worst link(s)/sec    =         0.50 ( 1.982422113 seconds/op)
>
> unlink:  Files =    216000 Total Time = 31.314268112 seconds
>         Avg unlink(s)/sec    =      6897.81 ( 0.000144973 seconds/op)
>         Best unlink(s)/sec   =    113359.57 ( 0.000008821 seconds/op)
>         Worst unlink(s)/sec  =         0.21 ( 4.793473005 seconds/op)
>
> delete:  Files =    216000 Total Time = 55.935692072 seconds
>         Avg delete(s)/sec    =      3861.58 ( 0.000258962 seconds/op)
>         Best delete(s)/sec   =     29537.35 ( 0.000033855 seconds/op)
>         Worst delete(s)/sec  =         0.65 ( 1.544568062 seconds/op)
>
>
>
>>>> Roy Dragseth <Roy.Dragseth@cc.uit.no> 6/23/2006 6:40 AM
>>>
> On Friday 23 June 2006 14:12, David Vasil wrote:
>> One problem with this test is that there is over an order of
> magnitude
>> more files than directories in the linux source tree.  While I dont
>> contend that this will shed more light on your problem, it would be
>> better to do an apples to apples test.
>
> It was intended more as a breakdown of events concerning the metadata
> than a
> comparison between the speed of directory creation and file creation.
>
> r.
>
> -- 
>
>  The Computer Center, University of Troms?, N-9037 TROMS? Norway.
>       phone:+47 77 64 41 07, fax:+47 77 64 41 00
>     Roy Dragseth, High Performance Computing System Administrator
> Direct call: +47 77 64 62 56. email: royd@cc.uit.no
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Roy Dragseth

2006-Jun-23 16:40 UTC

head link

[Lustre-discuss] I/O performance on small files

On Friday 23 June 2006 15:15, Peter J. Braam wrote:> Hi Roy,
>
> Boy, it''s a pleasure to work through this with you.  Iirc we
_should_ be
> able to do well on unpacking, but it has been a while.  However, until
> we have our metadata writeback cache (2008?), we cannot win over a local
> file system.  In a local file system the cache flushes write the newly
> created files to disk in huge batches.
Just to make things clear, I''m not whining, I just want to know if
I''ve done
something stupid;-)

We''re planning to deploy lustre both for a scratch file area, huge
files -
lots of io, and a home area, not so much io, but possibly lots of tar unpacks 
and compiles.

Just to make a comparison with our current NFS setup I ran the same tests, and 
it turns out that lustre isn''t that much worse in this case than nfs 
(although the nfs system was busy during the tests).  The file creation test 
took 27 secs on nfs in comparison to 32 secs on lustre:

$cd ~/tmp
$ time bash -c "cat /tmp/linuxsrcdirs.txt | xargs mkdir"

real    0m0.619s
user    0m0.008s
sys     0m0.083s
$ time bash -c "cat /tmp/linuxsrcfiles.txt | xargs touch"

real    0m27.775s
user    0m0.107s
sys     0m2.298s
$ rm -rf linux-2.6.9/
$ time tar xf /tmp/linux-2.6.9.tar

real    0m34.403s
user    0m0.175s
sys     0m4.133s


>
> Let''s first look at an order of magnitude issue here.  I suspect
you
> have about 15,000 files perhaps? 
Yup:
$ wc -l /tmp/linuxsrcfiles.txt
16448 /tmp/linuxsrcfiles.txt
> So the creation data below shows that 
> you create about 500/sec.  On big machines we see up to ~14,000 creates
> / second, on smaller systems maybe a few 1000. If we assume 2000, that
> means you''ll still be sitting in file creations for 7.5 seconds.  
What
> kind of an MDS do you have?
The MDS is running on a dual cpu 3.4GHz/2GB RAM HP Proliant DL380, this 
machine is also serving as one of two OSSs.  We have two of these and the 
plan is to run one MDS (and also use them as OSS) on each of these with 
failover.   The MDS storage is placed on the same storage as the OSTs, but 
moving it to another file system doesn''t seem to matter.
>
> First, let''s turn debugging off completely, on all nodes please:
>
> echo 0 > /proc/sys/portals/debug
Hey, hey, now we''re talking!  Turning off debugging on both the client
and the
MDSs and OSSs bring the time down from 32 to 14 secs:

$ cd /mnt/lustre
$ rm -rf linux-2.6.9/
$ time bash -c "cat /tmp/linuxsrcdirs.txt | xargs mkdir"

real    0m0.301s
user    0m0.009s
sys     0m0.064s
$ time bash -c "cat /tmp/linuxsrcfiles.txt | xargs touch"

real    0m14.346s
user    0m0.107s
sys     0m2.989s

(Still slapping my forehead for not thinking of this one...)
>
> Then try again.
>
> The files has objects on the OST which are supposed to be pre-created,
> and it shouldn''t interfere too much with performance.  Normal file
> creation performance is on-par with directory creation.  But something
> there is clearly awry.
>
> Could you see what "create_count" is set to in the OSC
> /proc/fs/lustre/OSC/*/... directories?  Putting something like a
$ cat /proc/fs/lustre/osc/*/create_count
32
32
32
32
32
32
32
32

Increasing these numbers doesn''t give any change.  Nor does the changes
suggested earlier in this thread.  I''ve tried the following:

/proc/fs/lustre/osc/*/create_count  =  128
/proc/fs/lustre/osc/*/max_rpcs_in_flight = 32
/proc/fs/lustre/ldlm/namespaces/MDC*/lru_size = 2000
/proc/fs/lustre/ldlm/namespaces/OSC*/lru_size = 1000
>
> Let me also ask you, how wide are you striping the files (lfs getstripe
> <unpackdir>)?  For best performance you want to set the stripe count
on
> a subdirectory to 1:
>
> lfs setstripe <unpack-dir>  4194304  -1 1
The default stripe count on the file system is 1 and the stripe size i 1MB:

lmc -m $CONFIG --add lov --lov lov-work --mds mds-work --stripe_sz 
1048576 --stripe_cnt 0 --stripe_pattern 0

so that should be ok.>
> Try these (one at a time please) and let us know how you are doing.  It
> may unfortunately take a few more iterations to get this right.
>
Turning off debugging gives us a significant increase in the performance over 
our current NFS setup.  With this setting we see a file creation rate of 
around 1000 per sec, do you suggest that this could be further increased?  
I''ll be happy to test any ideas you might have.

We are now seeing a significant performance increase over our current nfs 
setup:

IO for large files has increased from 50MB/s to >600MB/s.
File creation rate has doubled.

Besides, the lustre system doesn''t seem to fall over and die as soon as
a few
clients starts hitting it as our current nfs setup does.  This was an 
important design goal:-)

Best regards and have a nice weekend,
r.

Peter J. Braam

2006-Jun-23 23:09 UTC

head link

[Lustre-discuss] I/O performance on small files

Hi Roy,

Can this be faster? Certainly running an OST on the MDS is not going to
help, but I think you have the bulk of the performance now.   More
tuning is possible when you have multiple systems using the MDS
simultaneously - Andreas pointed out that the higher numbers I mentioned
can absolutely only be achieved with more than one client.

In a year or two we will have a writeback cache - at that point we hope
to get closer to the local file system situation.

Best wishes,

- Peter -

Lustre discuss - May 2006 - I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files

[Lustre-discuss] I/O performance on small files