Hi, I''ve created a small lustre testbed which i''m using to do experiments. n32 holds the mgs and one ost. I get 39MB/s when writing to the disk from n32. However, if i do the same experiments using dd on a mounted ost over rdma using o2ib i get 12.5MB/s!!! Why am''i experiencing a performance loss when my ib interconnect can deliver over 650MB/s ? I''m currently using HCA firmware 3.2 for the 23108 but i''m not sure this is the issue. The socknal on n32 when n31 is writing is only using 4% of CPU. how can i troubleshoot this performance issue ? ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca 0000:04:00.0: HCA FW version 3.2.0 is old (3.4.0 is current). ib_mthca 0000:04:00.0: If you have problems, try updating your HCA FW. n32: /dev/sdb1 899M 41M 808M 5% /mnt/lustre/mgs /dev/sdb2 9.9G 3.1G 6.4G 33% /mnt/lustre/sdb2 /dev/sdb3 30G 2.1G 27G 8% /mnt/lustre/sdb3 /dev/sdb4 27G 1.4G 25G 6% /mnt/lustre/sdb4 161.74.83.48@o2ib:/testfs 67G 6.5G 57G 11% /mnt/lustre/ost n32:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k 512000+0 records in 512000+0 records out 524288000 bytes (524 MB) copied, 13.406 seconds, 39.1 MB/s top - 14:20:19 up 1 day, 7:15, 2 users, load average: 0.00, 0.00, 0.00 Tasks: 272 total, 1 running, 271 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 2.3%sy, 0.0%ni, 81.1%id, 12.5%wa, 1.2%hi, 3.0%si, 0.0%st Mem: 515144k total, 508984k used, 6160k free, 349900k buffers Swap: 1052216k total, 148k used, 1052068k free, 51280k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4003 root 15 0 0 0 0 S 5 0.0 7:19.17 socknal_sd00 4358 root 15 0 0 0 0 S 0 0.0 0:06.08 ll_ost_io_04 4364 root 15 0 0 0 0 S 0 0.0 0:06.28 ll_ost_io_10 7412 root 16 0 2320 1136 768 R 0 0.2 0:00.21 top 1 root 16 0 716 284 244 S 0 0.1 0:00.87 init n31:/mnt/lustre/ost # df -h -h Filesystem Size Used Avail Use% Mounted on /dev/sda6 33G 4.9G 29G 15% / udev 252M 136K 252M 1% /dev 161.74.83.48@o2ib:/testfs 67G 6.5G 57G 11% /mnt/lustre/ost n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k 512000+0 records in 512000+0 records out 524288000 bytes (524 MB) copied, 41.9394 seconds, 12.5 MB/s Thierry. ---------------------------------------- Dr Thierry DELAITRE Systems and Services Manager, CSCS University of Westminster 115 New Cavendish Street, London W1W 6UW Tel: 020 7911 5000 ext: 3586 Fax: 020 7911 5089 Mobile short dial code 1788 http://www.cscs.wmin.ac.uk/~delaitt ---------------------------------------- This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must not copy or show them to anyone, nor should you take any action based on them, other than to notify the error by replying to the sender.
I''m now getting same perf on a lustre client than on the lustre server. I removed the tcp interface from modprobe and left o2ib only as for some reason lustre was using the tcp interface for communication. n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k 512000+0 records in 512000+0 records out 524288000 bytes (524 MB) copied, 12.6019 seconds, 41.6 MB/s Thierry. On Thu, 28 Sep 2006, Thierry Delaitre wrote:> > Hi, > > I''ve created a small lustre testbed which i''m using to do experiments. n32 > holds the mgs and one ost. I get 39MB/s when writing to the disk from n32. > However, if i do the same experiments using dd on a mounted ost over rdma > using o2ib i get 12.5MB/s!!! Why am''i experiencing a performance loss when > my ib interconnect can deliver over 650MB/s ? I''m currently using HCA > firmware 3.2 for the 23108 but i''m not sure this is the issue. The socknal > on n32 when n31 is writing is only using 4% of CPU. how can i troubleshoot > this performance issue ? > > ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > ib_mthca 0000:04:00.0: HCA FW version 3.2.0 is old (3.4.0 is current). > ib_mthca 0000:04:00.0: If you have problems, try updating your HCA FW. > > n32: > > /dev/sdb1 899M 41M 808M 5% /mnt/lustre/mgs > /dev/sdb2 9.9G 3.1G 6.4G 33% /mnt/lustre/sdb2 > /dev/sdb3 30G 2.1G 27G 8% /mnt/lustre/sdb3 > /dev/sdb4 27G 1.4G 25G 6% /mnt/lustre/sdb4 > 161.74.83.48@o2ib:/testfs > 67G 6.5G 57G 11% /mnt/lustre/ost > > n32:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > 512000+0 records in > 512000+0 records out > 524288000 bytes (524 MB) copied, 13.406 seconds, 39.1 MB/s > > top - 14:20:19 up 1 day, 7:15, 2 users, load average: 0.00, 0.00, 0.00 > Tasks: 272 total, 1 running, 271 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 2.3%sy, 0.0%ni, 81.1%id, 12.5%wa, 1.2%hi, 3.0%si, > 0.0%st > Mem: 515144k total, 508984k used, 6160k free, 349900k buffers > Swap: 1052216k total, 148k used, 1052068k free, 51280k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4003 root 15 0 0 0 0 S 5 0.0 7:19.17 socknal_sd00 > 4358 root 15 0 0 0 0 S 0 0.0 0:06.08 ll_ost_io_04 > 4364 root 15 0 0 0 0 S 0 0.0 0:06.28 ll_ost_io_10 > 7412 root 16 0 2320 1136 768 R 0 0.2 0:00.21 top > 1 root 16 0 716 284 244 S 0 0.1 0:00.87 init > > n31:/mnt/lustre/ost # df -h -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda6 33G 4.9G 29G 15% / > udev 252M 136K 252M 1% /dev > 161.74.83.48@o2ib:/testfs > 67G 6.5G 57G 11% /mnt/lustre/ost > > n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > 512000+0 records in > 512000+0 records out > 524288000 bytes (524 MB) copied, 41.9394 seconds, 12.5 MB/s > > Thierry.
Thierry, How did you get around your problems with o2iblnd/OFED symbol versions? Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------> -----Original Message----- > From: lustre-discuss-bounces@clusterfs.com > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of > Thierry Delaitre > Sent: 29 September 2006 10:09 AM > To: lustre-discuss@clusterfs.com > Subject: [Lustre-discuss] Re: performance issues > > > I''m now getting same perf on a lustre client than on the > lustre server. I > removed the tcp interface from modprobe and left o2ib only as for some > reason lustre was using the tcp interface for communication. > > n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > 512000+0 records in > 512000+0 records out > 524288000 bytes (524 MB) copied, 12.6019 seconds, 41.6 MB/s > > Thierry. > > On Thu, 28 Sep 2006, Thierry Delaitre wrote: > > > > > Hi, > > > > I''ve created a small lustre testbed which i''m using to do > experiments. n32 > > holds the mgs and one ost. I get 39MB/s when writing to the > disk from n32. > > However, if i do the same experiments using dd on a mounted > ost over rdma > > using o2ib i get 12.5MB/s!!! Why am''i experiencing a > performance loss when > > my ib interconnect can deliver over 650MB/s ? I''m currently > using HCA > > firmware 3.2 for the 23108 but i''m not sure this is the > issue. The socknal > > on n32 when n31 is writing is only using 4% of CPU. how can > i troubleshoot > > this performance issue ? > > > > ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > > ib_mthca 0000:04:00.0: HCA FW version 3.2.0 is old (3.4.0 > is current). > > ib_mthca 0000:04:00.0: If you have problems, try updating > your HCA FW. > > > > n32: > > > > /dev/sdb1 899M 41M 808M 5% /mnt/lustre/mgs > > /dev/sdb2 9.9G 3.1G 6.4G 33% /mnt/lustre/sdb2 > > /dev/sdb3 30G 2.1G 27G 8% /mnt/lustre/sdb3 > > /dev/sdb4 27G 1.4G 25G 6% /mnt/lustre/sdb4 > > 161.74.83.48@o2ib:/testfs > > 67G 6.5G 57G 11% /mnt/lustre/ost > > > > n32:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > > 512000+0 records in > > 512000+0 records out > > 524288000 bytes (524 MB) copied, 13.406 seconds, 39.1 MB/s > > > > top - 14:20:19 up 1 day, 7:15, 2 users, load average: > 0.00, 0.00, 0.00 > > Tasks: 272 total, 1 running, 271 sleeping, 0 stopped, 0 zombie > > Cpu(s): 0.0%us, 2.3%sy, 0.0%ni, 81.1%id, 12.5%wa, > 1.2%hi, 3.0%si, > > 0.0%st > > Mem: 515144k total, 508984k used, 6160k free, > 349900k buffers > > Swap: 1052216k total, 148k used, 1052068k free, > 51280k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 4003 root 15 0 0 0 0 S 5 0.0 > 7:19.17 socknal_sd00 > > 4358 root 15 0 0 0 0 S 0 0.0 > 0:06.08 ll_ost_io_04 > > 4364 root 15 0 0 0 0 S 0 0.0 > 0:06.28 ll_ost_io_10 > > 7412 root 16 0 2320 1136 768 R 0 0.2 0:00.21 top > > 1 root 16 0 716 284 244 S 0 0.1 0:00.87 init > > > > n31:/mnt/lustre/ost # df -h -h > > Filesystem Size Used Avail Use% Mounted on > > /dev/sda6 33G 4.9G 29G 15% / > > udev 252M 136K 252M 1% /dev > > 161.74.83.48@o2ib:/testfs > > 67G 6.5G 57G 11% /mnt/lustre/ost > > > > n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > > 512000+0 records in > > 512000+0 records out > > 524288000 bytes (524 MB) copied, 41.9394 seconds, 12.5 MB/s > > > > Thierry. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Eric,> How did you get around your problems with o2iblnd/OFED symbol versions?3 things: - i think lustre was using the include files of the IB module that comes natively with the linux kernel. I removed the /usr/src/linux-2.6.16.21-0.8/include/rdma and created a symlink from /usr/src/linux-2.6.16.21-0.8/drivers/infiniband to /usr/local/ofed/src/openib/drivers/infiniband - in addition to the issue above, i had to set mod_version to ''n'' as a workaround in the kernel. Someone else is also experiencing this: https://mail.clusterfs.com/pipermail/lustre-discuss/2006-July/001834.html - you have to compile lustre with /usr/local/ofed/src/openib and not /usr/local/ofed/src/openib-1.1 mkdir -p /usr/local/ofed/src/openib/drivers/infiniband ln -s /usr/local/ofed/src/openib/include /usr/local/ofed/src/openib/drivers/infiniband ./configure --with-o2ib=/usr/local/ofed/src/openib Cheers, Thierry. On Fri, 29 Sep 2006, Eric Barton wrote:> Cheers, > Eric > > --------------------------------------------------- > |Eric Barton Barton Software | > |9 York Gardens Tel: +44 (117) 330 1575 | > |Clifton Mobile: +44 (7909) 680 356 | > |Bristol BS8 4LL Fax: call first | > |United Kingdom E-Mail: eeb@bartonsoftware.com| > --------------------------------------------------- > > > > -----Original Message----- > > From: lustre-discuss-bounces@clusterfs.com > > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of > > Thierry Delaitre > > Sent: 29 September 2006 10:09 AM > > To: lustre-discuss@clusterfs.com > > Subject: [Lustre-discuss] Re: performance issues > > > > > > I''m now getting same perf on a lustre client than on the > > lustre server. I > > removed the tcp interface from modprobe and left o2ib only as for some > > reason lustre was using the tcp interface for communication. > > > > n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > > 512000+0 records in > > 512000+0 records out > > 524288000 bytes (524 MB) copied, 12.6019 seconds, 41.6 MB/s > > > > Thierry. > > > > On Thu, 28 Sep 2006, Thierry Delaitre wrote: > > > > > > > > Hi, > > > > > > I''ve created a small lustre testbed which i''m using to do > > experiments. n32 > > > holds the mgs and one ost. I get 39MB/s when writing to the > > disk from n32. > > > However, if i do the same experiments using dd on a mounted > > ost over rdma > > > using o2ib i get 12.5MB/s!!! Why am''i experiencing a > > performance loss when > > > my ib interconnect can deliver over 650MB/s ? I''m currently > > using HCA > > > firmware 3.2 for the 23108 but i''m not sure this is the > > issue. The socknal > > > on n32 when n31 is writing is only using 4% of CPU. how can > > i troubleshoot > > > this performance issue ? > > > > > > ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > > > ib_mthca 0000:04:00.0: HCA FW version 3.2.0 is old (3.4.0 > > is current). > > > ib_mthca 0000:04:00.0: If you have problems, try updating > > your HCA FW. > > > > > > n32: > > > > > > /dev/sdb1 899M 41M 808M 5% /mnt/lustre/mgs > > > /dev/sdb2 9.9G 3.1G 6.4G 33% /mnt/lustre/sdb2 > > > /dev/sdb3 30G 2.1G 27G 8% /mnt/lustre/sdb3 > > > /dev/sdb4 27G 1.4G 25G 6% /mnt/lustre/sdb4 > > > 161.74.83.48@o2ib:/testfs > > > 67G 6.5G 57G 11% /mnt/lustre/ost > > > > > > n32:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > > > 512000+0 records in > > > 512000+0 records out > > > 524288000 bytes (524 MB) copied, 13.406 seconds, 39.1 MB/s > > > > > > top - 14:20:19 up 1 day, 7:15, 2 users, load average: > > 0.00, 0.00, 0.00 > > > Tasks: 272 total, 1 running, 271 sleeping, 0 stopped, 0 zombie > > > Cpu(s): 0.0%us, 2.3%sy, 0.0%ni, 81.1%id, 12.5%wa, > > 1.2%hi, 3.0%si, > > > 0.0%st > > > Mem: 515144k total, 508984k used, 6160k free, > > 349900k buffers > > > Swap: 1052216k total, 148k used, 1052068k free, > > 51280k cached > > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > 4003 root 15 0 0 0 0 S 5 0.0 > > 7:19.17 socknal_sd00 > > > 4358 root 15 0 0 0 0 S 0 0.0 > > 0:06.08 ll_ost_io_04 > > > 4364 root 15 0 0 0 0 S 0 0.0 > > 0:06.28 ll_ost_io_10 > > > 7412 root 16 0 2320 1136 768 R 0 0.2 0:00.21 top > > > 1 root 16 0 716 284 244 S 0 0.1 0:00.87 init > > > > > > n31:/mnt/lustre/ost # df -h -h > > > Filesystem Size Used Avail Use% Mounted on > > > /dev/sda6 33G 4.9G 29G 15% / > > > udev 252M 136K 252M 1% /dev > > > 161.74.83.48@o2ib:/testfs > > > 67G 6.5G 57G 11% /mnt/lustre/ost > > > > > > n31:/mnt/lustre/ost # dd if=/dev/zero of=titi bs=1k count=500k > > > 512000+0 records in > > > 512000+0 records out > > > 524288000 bytes (524 MB) copied, 41.9394 seconds, 12.5 MB/s > > > > > > Thierry. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@clusterfs.com > > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > > > >---------------------------------------- Dr Thierry DELAITRE Systems and Services Manager, CSCS University of Westminster 115 New Cavendish Street, London W1W 6UW Tel: 020 7911 5000 ext: 3586 Fax: 020 7911 5089 Mobile short dial code 1788 http://www.cscs.wmin.ac.uk/~delaitt ---------------------------------------- This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must not copy or show them to anyone, nor should you take any action based on them, other than to notify the error by replying to the sender.