thr3ads.net - Lustre discuss - [Lustre-discuss] Strange Lustre performance [Nov 2006]

If this information is useful, please help other people find it:
Share via:

jan taubert (RRes-Roth)

2006-Nov-17 09:42 UTC

[Lustre-discuss] Strange Lustre performance

Hello,

My Lustre setup is as follows (see attached setup file):
- Two MDS server in active/passive failover
- Four OSS server in active/active failover
- Mellanox InfiniBand interconnect
- SATA2-FC SAN storage

All servers are Dual AMD Opteron 248 with 4GB RAM running RHEL4 U4 and
Lustre 1.4.7.3 (RPM version). The MDS server are attached to a 16 disk
SATA2-FC4G RAID configured as RAID10 with 4TB capacity. The OSS server
are as pairs attached to two 24 disk SATA2-FC2G RAIDs configured with
redundant paths and controllers and two RAID5 volumes, each providing
5TB capacity. All servers are using Qlogic FC controller QLogic SANBlade
QLE2460 (MDS) and Q-Logic SANBlade QLE2462 (OSS) with qla2xxx-8.01.06
driver. All other hardware drivers are standard RedHat.

I am mounting this Lustre system over InfiniBand on a Quad AMD Opteron
875 with 16GB RAM and run some simple tests in comparision to direct
attached storage of a 16 disk 1TB FC-FC2G RAID system with 8 disks as
RAID5. The server with the attached storage is a Dual AMD Opteron 248
with 8GB RAM.

First test: Download file nr.gz from
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz (945MB) and run "time gzip
-d nr.gz"

Lustre: 
real    0m39.343s
user    0m27.603s
sys     0m11.716s
  
Attached storage:
real    0m43.049s
user    0m28.692s
sys     0m8.630s

Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB)
database "time formatdb -i nr"

Lustre:
real    12m42.931s
user    3m29.847s
sys     9m9.381s

Attached storage:
real    4m6.323s
user    3m21.581s
sys     0m43.857s

Why is the attached storage in this case three times faster then the
Lustre filesystem? When you compare the two "time" outputs you see
that
the "sys" part is much higher on Lustre than on the attached storage.
Any suggestions, what I did wrong?

Many thanks,
Jan

Jan Taubert 
Biomathematics and Bioinformatics Division
Rothamsted Research 
West Common, Harpenden, Hertfordshire. AL5 2JQ, UK 
tel: 01582 763133 ext 2108 
fax: 01582 760981 
email: jan.taubert@bbsrc.ac.uk

Rothamsted Research is a company limited by guarantee, registered in
England under the registration number 2393175 and a not for profit
charity number 802038.
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: setup.sh
Type: application/octet-stream
Size: 1897 bytes
Desc: setup.sh
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061117/f6d4be85/setup.obj

Jean-Marc Saffroy

2006-Nov-17 10:18 UTC

head link

[Lustre-discuss] Strange Lustre performance

On Fri, 17 Nov 2006, jan taubert (RRes-Roth) wrote:
> Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB)
> database "time formatdb -i nr"
>
> Lustre:
> real    12m42.931s
> user    3m29.847s
> sys     9m9.381s
>
> Attached storage:
> real    4m6.323s
> user    3m21.581s
> sys     0m43.857s
>
> Why is the attached storage in this case three times faster then the
> Lustre filesystem? When you compare the two "time" outputs you
see that
> the "sys" part is much higher on Lustre than on the attached
storage.
> Any suggestions, what I did wrong?
You did not necessarily do anything wrong, but some workloads require 
additional tuning, and in some cases local storage can be faster than 
Lustre.

I don''t know the program you used at all, but it sounds like it could
be
performing lots of small I/Os and/or operate on many small files.

The output of "strace -fc formatdb ..." (maybe with a smaller working
set)
would certainly give us hints, and the full output of "strace -fttT
..."
can be useful as well (unless this program uses mmap() for I/Os).

If the program operates on large files, then you should probably look at 
striping parameters (eg. stripe over all OSTs, and try different stripe 
sizes).

HTH

-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

jan taubert (RRes-Roth)

2006-Nov-20 02:24 UTC

head link

[Lustre-discuss] Strange Lustre performance

Many thanks for your reply.

I did as you mentioned and attached the two strace outputs for a smaller
file.

Just a quick explenation what formatdb does: It is part of the BLAST
suite, which is the most common uses bioinformatics utility for the
analysis of gen or protein sequences. The BLAST suite used is the binary
distribution provided by NCBI
http://www.ncbi.nlm.nih.gov/BLAST/download.shtml . Formatdb takes a file
of gen or protein sequences and converts it to a format used by the
BLAST application.

Would it be possible to tune Lustre for this kind of workload? 

I saw that formatdb tries to perform mmap operations first and then
executes alternating write and lseek operations. My Lustre setup script
is attached.

Jan

 


-----Original Message-----
From: Jean-Marc Saffroy [mailto:jean-marc.saffroy@ext.bull.net] 
Sent: 17 November 2006 17:18
To: jan taubert (RRes-Roth)
Cc: lustre-discuss@clusterfs.com
Subject: Re: [Lustre-discuss] Strange Lustre performance


On Fri, 17 Nov 2006, jan taubert (RRes-Roth) wrote:
> Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB) 
> database "time formatdb -i nr"
>
> Lustre:
> real    12m42.931s
> user    3m29.847s
> sys     9m9.381s
>
> Attached storage:
> real    4m6.323s
> user    3m21.581s
> sys     0m43.857s
>
> Why is the attached storage in this case three times faster then the 
> Lustre filesystem? When you compare the two "time" outputs you
see
> that the "sys" part is much higher on Lustre than on the attached
> storage. Any suggestions, what I did wrong?
You did not necessarily do anything wrong, but some workloads require 
additional tuning, and in some cases local storage can be faster than 
Lustre.

I don''t know the program you used at all, but it sounds like it could
be

performing lots of small I/Os and/or operate on many small files.

The output of "strace -fc formatdb ..." (maybe with a smaller working
set) 
would certainly give us hints, and the full output of "strace -fttT
..."

can be useful as well (unless this program uses mmap() for I/Os).

If the program operates on large files, then you should probably look at

striping parameters (eg. stripe over all OSTs, and try different stripe 
sizes).


HTH

-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
-------------- next part --------------
Process 18117 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.00    0.622516          73      8509           write
  4.49    0.030022           7      4244           lseek
  0.89    0.005988         193        31        13 open
  0.42    0.002826         217        13           read
  0.41    0.002766         198        14        13 stat
  0.30    0.001995        1995         1           execve
  0.23    0.001512          84        18           fstat
  0.14    0.000934          44        21           close
  0.04    0.000300          21        14           munmap
  0.03    0.000178         178         1           poll
  0.02    0.000148           5        29           mmap
  0.00    0.000029           5         6           mprotect
  0.00    0.000029          10         3           socket
  0.00    0.000023          12         2         2 connect
  0.00    0.000023          23         1           sendto
  0.00    0.000014           2         7           fcntl
  0.00    0.000011           4         3           time
  0.00    0.000009           9         1           pread
  0.00    0.000008           4         2           getcwd
  0.00    0.000007           2         3           brk
  0.00    0.000007           7         1         1 access
  0.00    0.000006           3         2           uname
  0.00    0.000004           4         1           setsockopt
  0.00    0.000003           3         1           ioctl
  0.00    0.000003           3         1           recvfrom
  0.00    0.000003           3         1         1 bind
  0.00    0.000003           3         1           gettimeofday
  0.00    0.000003           3         1           arch_prctl
  0.00    0.000002           2         1           getpid
  0.00    0.000002           2         1           getuid
------ ----------- ----------- --------- --------- ----------------
100.00    0.669374                 12934        30 total
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strace-fttT.zip
Type: application/x-zip-compressed
Size: 234854 bytes
Desc: strace-fttT.zip
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061120/f042c225/strace-fttT-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: setup.sh
Type: application/octet-stream
Size: 1897 bytes
Desc: setup.sh
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061120/f042c225/setup-0001.obj

Jean-Marc Saffroy

2006-Nov-20 06:46 UTC

head link

[Lustre-discuss] Strange Lustre performance

On Mon, 20 Nov 2006, jan taubert (RRes-Roth) wrote:
> Many thanks for your reply.
>
> I did as you mentioned and attached the two strace outputs for a smaller
> file.
That''s interesting:

  - the program does not mmap() its output files, so all I/Os are performed 
through syscalls, and thus strace shows all of them

  - "strace -c" shows that most of the time is spent performing
writes:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  93.00    0.622516          73      8509           write

  - an analysis of "strace -fttT" shows that two files receive almost
all
writes, which are *very* small on average:

time (sec)       read    written   avg.read  avg.write        ops file
      0.375          0     355090         -1         83       8488 ncbEC.phr
      0.323          0    1355565         -1        319       4247 ncbEC.psq
>From here you have several options to improve performance:
  - make larger writes: it could be a matter of command line options, input 
files, build options, or maybe the code needs patching (for example, use 
buffered I/Os with large buffers, if it''s possible); this will probably
get you the best performance improvement

  - you can also set sensible striping options on Lustre: I would suggest 
striping over all OSTs with the smallest stripe size possible (64KB), this 
can be accomplished with lfs, eg:
   $ lfs setstripe <directory> 65536 -1 -1
> Formatdb takes a file of gen or protein sequences and converts it to a 
> format used by the BLAST application.
If formatdb is only used to initialise a file that will then be used most 
of the time by another program, then striping should probably be optimized 
for the latter.


Cheers,

-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Jean-Marc Saffroy

2006-Nov-20 06:57 UTC

head link

[Lustre-discuss] Strange Lustre performance

On Mon, 20 Nov 2006, Jean-Marc Saffroy wrote:
> - you can also set sensible striping options on Lustre: I would suggest 
> striping over all OSTs with the smallest stripe size possible (64KB), this 
> can be accomplished with lfs, eg:
>  $ lfs setstripe <directory> 65536 -1 -1
Actually in your Lustre configuration script, you already set to stripe 
over all OSTs. A small stripe size could still help, but having formatdb 
do bigger writes is certainly the best option (if it''s possible).


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

jan taubert (RRes-Roth)

2006-Nov-21 03:14 UTC

head link

[Lustre-discuss] Strange Lustre performance

Modifying the programm will be problematic, because we try to stick with
pretty standard software and keep them uptodate with new releases.
Although I doubt that there is anyone around in my research institute
who would be able to modify the BLAST programm, including me.

I m trying the smaller stripe size now and I was wondering what impact
the stripe size of the attached RAID systems on the OSS may have and if
there is also an impact from the packet size of the FC layer between OSS
and RAID.

Jan 

-----Original Message-----
From: Jean-Marc Saffroy [mailto:jean-marc.saffroy@ext.bull.net] 
Sent: 20 November 2006 13:57
To: jan taubert (RRes-Roth)
Cc: lustre-discuss@clusterfs.com
Subject: RE: [Lustre-discuss] Strange Lustre performance

On Mon, 20 Nov 2006, Jean-Marc Saffroy wrote:
> - you can also set sensible striping options on Lustre: I would 
> suggest
> striping over all OSTs with the smallest stripe size possible (64KB),
this > can be accomplished with lfs, eg:
>  $ lfs setstripe <directory> 65536 -1 -1
Actually in your Lustre configuration script, you already set to stripe 
over all OSTs. A small stripe size could still help, but having formatdb

do bigger writes is certainly the best option (if it''s possible).

-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Jean-Marc Saffroy

2006-Nov-21 06:33 UTC

head link

[Lustre-discuss] Strange Lustre performance

On Tue, 21 Nov 2006, jan taubert (RRes-Roth) wrote:
> Modifying the programm will be problematic, because we try to stick with
> pretty standard software and keep them uptodate with new releases.
> Although I doubt that there is anyone around in my research institute
> who would be able to modify the BLAST programm, including me.
Maybe the BLAST maintainers would be open to suggestions? The write 
pattern in formatdb is pretty inefficient.


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Lustre discuss - Nov 2006 - Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance

[Lustre-discuss] Strange Lustre performance