Hello, My Lustre setup is as follows (see attached setup file): - Two MDS server in active/passive failover - Four OSS server in active/active failover - Mellanox InfiniBand interconnect - SATA2-FC SAN storage All servers are Dual AMD Opteron 248 with 4GB RAM running RHEL4 U4 and Lustre 1.4.7.3 (RPM version). The MDS server are attached to a 16 disk SATA2-FC4G RAID configured as RAID10 with 4TB capacity. The OSS server are as pairs attached to two 24 disk SATA2-FC2G RAIDs configured with redundant paths and controllers and two RAID5 volumes, each providing 5TB capacity. All servers are using Qlogic FC controller QLogic SANBlade QLE2460 (MDS) and Q-Logic SANBlade QLE2462 (OSS) with qla2xxx-8.01.06 driver. All other hardware drivers are standard RedHat. I am mounting this Lustre system over InfiniBand on a Quad AMD Opteron 875 with 16GB RAM and run some simple tests in comparision to direct attached storage of a 16 disk 1TB FC-FC2G RAID system with 8 disks as RAID5. The server with the attached storage is a Dual AMD Opteron 248 with 8GB RAM. First test: Download file nr.gz from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz (945MB) and run "time gzip -d nr.gz" Lustre: real 0m39.343s user 0m27.603s sys 0m11.716s Attached storage: real 0m43.049s user 0m28.692s sys 0m8.630s Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB) database "time formatdb -i nr" Lustre: real 12m42.931s user 3m29.847s sys 9m9.381s Attached storage: real 4m6.323s user 3m21.581s sys 0m43.857s Why is the attached storage in this case three times faster then the Lustre filesystem? When you compare the two "time" outputs you see that the "sys" part is much higher on Lustre than on the attached storage. Any suggestions, what I did wrong? Many thanks, Jan Jan Taubert Biomathematics and Bioinformatics Division Rothamsted Research West Common, Harpenden, Hertfordshire. AL5 2JQ, UK tel: 01582 763133 ext 2108 fax: 01582 760981 email: jan.taubert@bbsrc.ac.uk Rothamsted Research is a company limited by guarantee, registered in England under the registration number 2393175 and a not for profit charity number 802038. -------------- next part -------------- A non-text attachment was scrubbed... Name: setup.sh Type: application/octet-stream Size: 1897 bytes Desc: setup.sh Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061117/f6d4be85/setup.obj
On Fri, 17 Nov 2006, jan taubert (RRes-Roth) wrote:> Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB) > database "time formatdb -i nr" > > Lustre: > real 12m42.931s > user 3m29.847s > sys 9m9.381s > > Attached storage: > real 4m6.323s > user 3m21.581s > sys 0m43.857s > > Why is the attached storage in this case three times faster then the > Lustre filesystem? When you compare the two "time" outputs you see that > the "sys" part is much higher on Lustre than on the attached storage. > Any suggestions, what I did wrong?You did not necessarily do anything wrong, but some workloads require additional tuning, and in some cases local storage can be faster than Lustre. I don''t know the program you used at all, but it sounds like it could be performing lots of small I/Os and/or operate on many small files. The output of "strace -fc formatdb ..." (maybe with a smaller working set) would certainly give us hints, and the full output of "strace -fttT ..." can be useful as well (unless this program uses mmap() for I/Os). If the program operates on large files, then you should probably look at striping parameters (eg. stripe over all OSTs, and try different stripe sizes). HTH -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Many thanks for your reply. I did as you mentioned and attached the two strace outputs for a smaller file. Just a quick explenation what formatdb does: It is part of the BLAST suite, which is the most common uses bioinformatics utility for the analysis of gen or protein sequences. The BLAST suite used is the binary distribution provided by NCBI http://www.ncbi.nlm.nih.gov/BLAST/download.shtml . Formatdb takes a file of gen or protein sequences and converts it to a format used by the BLAST application. Would it be possible to tune Lustre for this kind of workload? I saw that formatdb tries to perform mmap operations first and then executes alternating write and lseek operations. My Lustre setup script is attached. Jan -----Original Message----- From: Jean-Marc Saffroy [mailto:jean-marc.saffroy@ext.bull.net] Sent: 17 November 2006 17:18 To: jan taubert (RRes-Roth) Cc: lustre-discuss@clusterfs.com Subject: Re: [Lustre-discuss] Strange Lustre performance On Fri, 17 Nov 2006, jan taubert (RRes-Roth) wrote:> Second test: Using BLAST 2.2.15 formatdb command for the nr (2155MB) > database "time formatdb -i nr" > > Lustre: > real 12m42.931s > user 3m29.847s > sys 9m9.381s > > Attached storage: > real 4m6.323s > user 3m21.581s > sys 0m43.857s > > Why is the attached storage in this case three times faster then the > Lustre filesystem? When you compare the two "time" outputs you see > that the "sys" part is much higher on Lustre than on the attached > storage. Any suggestions, what I did wrong?You did not necessarily do anything wrong, but some workloads require additional tuning, and in some cases local storage can be faster than Lustre. I don''t know the program you used at all, but it sounds like it could be performing lots of small I/Os and/or operate on many small files. The output of "strace -fc formatdb ..." (maybe with a smaller working set) would certainly give us hints, and the full output of "strace -fttT ..." can be useful as well (unless this program uses mmap() for I/Os). If the program operates on large files, then you should probably look at striping parameters (eg. stripe over all OSTs, and try different stripe sizes). HTH -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net -------------- next part -------------- Process 18117 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 93.00 0.622516 73 8509 write 4.49 0.030022 7 4244 lseek 0.89 0.005988 193 31 13 open 0.42 0.002826 217 13 read 0.41 0.002766 198 14 13 stat 0.30 0.001995 1995 1 execve 0.23 0.001512 84 18 fstat 0.14 0.000934 44 21 close 0.04 0.000300 21 14 munmap 0.03 0.000178 178 1 poll 0.02 0.000148 5 29 mmap 0.00 0.000029 5 6 mprotect 0.00 0.000029 10 3 socket 0.00 0.000023 12 2 2 connect 0.00 0.000023 23 1 sendto 0.00 0.000014 2 7 fcntl 0.00 0.000011 4 3 time 0.00 0.000009 9 1 pread 0.00 0.000008 4 2 getcwd 0.00 0.000007 2 3 brk 0.00 0.000007 7 1 1 access 0.00 0.000006 3 2 uname 0.00 0.000004 4 1 setsockopt 0.00 0.000003 3 1 ioctl 0.00 0.000003 3 1 recvfrom 0.00 0.000003 3 1 1 bind 0.00 0.000003 3 1 gettimeofday 0.00 0.000003 3 1 arch_prctl 0.00 0.000002 2 1 getpid 0.00 0.000002 2 1 getuid ------ ----------- ----------- --------- --------- ---------------- 100.00 0.669374 12934 30 total -------------- next part -------------- A non-text attachment was scrubbed... Name: strace-fttT.zip Type: application/x-zip-compressed Size: 234854 bytes Desc: strace-fttT.zip Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061120/f042c225/strace-fttT-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: setup.sh Type: application/octet-stream Size: 1897 bytes Desc: setup.sh Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061120/f042c225/setup-0001.obj
On Mon, 20 Nov 2006, jan taubert (RRes-Roth) wrote:> Many thanks for your reply. > > I did as you mentioned and attached the two strace outputs for a smaller > file.That''s interesting: - the program does not mmap() its output files, so all I/Os are performed through syscalls, and thus strace shows all of them - "strace -c" shows that most of the time is spent performing writes: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 93.00 0.622516 73 8509 write - an analysis of "strace -fttT" shows that two files receive almost all writes, which are *very* small on average: time (sec) read written avg.read avg.write ops file 0.375 0 355090 -1 83 8488 ncbEC.phr 0.323 0 1355565 -1 319 4247 ncbEC.psq>From here you have several options to improve performance:- make larger writes: it could be a matter of command line options, input files, build options, or maybe the code needs patching (for example, use buffered I/Os with large buffers, if it''s possible); this will probably get you the best performance improvement - you can also set sensible striping options on Lustre: I would suggest striping over all OSTs with the smallest stripe size possible (64KB), this can be accomplished with lfs, eg: $ lfs setstripe <directory> 65536 -1 -1> Formatdb takes a file of gen or protein sequences and converts it to a > format used by the BLAST application.If formatdb is only used to initialise a file that will then be used most of the time by another program, then striping should probably be optimized for the latter. Cheers, -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
On Mon, 20 Nov 2006, Jean-Marc Saffroy wrote:> - you can also set sensible striping options on Lustre: I would suggest > striping over all OSTs with the smallest stripe size possible (64KB), this > can be accomplished with lfs, eg: > $ lfs setstripe <directory> 65536 -1 -1Actually in your Lustre configuration script, you already set to stripe over all OSTs. A small stripe size could still help, but having formatdb do bigger writes is certainly the best option (if it''s possible). -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Modifying the programm will be problematic, because we try to stick with pretty standard software and keep them uptodate with new releases. Although I doubt that there is anyone around in my research institute who would be able to modify the BLAST programm, including me. I m trying the smaller stripe size now and I was wondering what impact the stripe size of the attached RAID systems on the OSS may have and if there is also an impact from the packet size of the FC layer between OSS and RAID. Jan -----Original Message----- From: Jean-Marc Saffroy [mailto:jean-marc.saffroy@ext.bull.net] Sent: 20 November 2006 13:57 To: jan taubert (RRes-Roth) Cc: lustre-discuss@clusterfs.com Subject: RE: [Lustre-discuss] Strange Lustre performance On Mon, 20 Nov 2006, Jean-Marc Saffroy wrote:> - you can also set sensible striping options on Lustre: I would > suggest > striping over all OSTs with the smallest stripe size possible (64KB),this> can be accomplished with lfs, eg: > $ lfs setstripe <directory> 65536 -1 -1Actually in your Lustre configuration script, you already set to stripe over all OSTs. A small stripe size could still help, but having formatdb do bigger writes is certainly the best option (if it''s possible). -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
On Tue, 21 Nov 2006, jan taubert (RRes-Roth) wrote:> Modifying the programm will be problematic, because we try to stick with > pretty standard software and keep them uptodate with new releases. > Although I doubt that there is anyone around in my research institute > who would be able to modify the BLAST programm, including me.Maybe the BLAST maintainers would be open to suggestions? The write pattern in formatdb is pretty inefficient. -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net