-- Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP drbd 8.3.5-(api:88/proto:86-91) pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default Well, I''v setup everything using 64bit kernel, for now I got ~4 TB of usable space with one lustre fs ost volume. I just did some speed tests between client and filesystem server, with dedicated GbitEthernet connection, I compared uploading via lustre-mounted share, and uploading to the same share, mounted as loopback lustre client on filesystem server and reexported via nfs. Results are quite sad, I have dreadfully slow write to remote lustrefs directly, while write to lustre fs reexported via nfs is at least 10 times faster.. Client machine is Xeon 2.4 with 4GB RAM, and server machine is Xeon 3.0Gh with 8GB ram. I reviewed tuning chapter from lustre manual, tuned rx of ethernet interface with ethtool. Lustre volumes (mgs,mdt,ost) are set up on UpToDate (synchronized) drbd resources (synchronization already finished, via dedicated 1Gbit link, not the same interface used to communicate with lustre clients.) I''d blame drbd for this, well, some cost is expected with drbd, but nfs-reexported locally-mounte lfs volume obviously goes through drbd stack too! DRBD resource is setup as backend storage device for lustre, so actually it''s not possible to write or read anything from/to lustre with skipping drbd stack. Machines are load-free. Seems, that with client-initiated write, the way lustre client => lustre server => drbd resource "X" is dramatically slower than nfs clinet => nfs server => loopback lustre server => drbd resource "X". And this is definitely not expected. Below are example transfer rates. Any ideas for this? Is this, for example, some difference between nfs and lustre for in-the-middle gigabit switch performance ? aleft:~# free -m total used free shared buffers cached Mem: 7987 3861 4126 0 102 3475 -/+ buffers/cache: 282 7705 Swap: 1906 0 1906 aleft:~# logout Connection to master closed. b02:~# free -m total used free shared buffers cached Mem: 4054 3908 145 0 43 1813 -/+ buffers/cache: 2051 2002 Swap: 7812 0 7812 b02:~# b02:~# ssh root at master [..] aleft:~# mount -t lustre /dev/drbd0 on /mnt/mgs type lustre (rw,noauto) /dev/drbd1 on /mnt/mdt type lustre (rw,noauto,_netdev) /dev/drbd2 on /mnt/ost01 type lustre (rw,noauto,_netdev) master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev) aleft:~# logout b02:~# mount -t lustre master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev) b02:~# mount -t nfs |grep master master:/mnt/lfs00 on /mnt/nfs00 type nfs (rw,addr=192.168.0.100) b02:~# Connection to master closed. b02:~# ./100mb.sh lfs00-send time dd if=/dev/zero of=/mnt/lfs00/testfile-b02 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 22.3427 s, 4.7 MB/s real 0m22.345s user 0m0.100s sys 0m3.760s lfs00-get time dd of=testfile-b02 if=/mnt/lfs00/testfile-b02 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 0.987265 s, 106 MB/s real 0m0.989s user 0m0.040s sys 0m0.880s b02:~# ./100mb-nfs.sh nfs00-send time dd if=/dev/zero of=/mnt/nfs00/testfile-b02 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 1.05942 s, 99.0 MB/s real 0m1.061s user 0m0.028s sys 0m0.252s nfs00-get time dd of=testfile-b02 if=/mnt/nfs00/testfile-b02 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 0.576351 s, 182 MB/s real 0m0.578s user 0m0.016s sys 0m0.556s b02:~#
To be exact, I got similar rates with hundred files , one megabyte each. so it''s rather not about size of the files. Lustre client rate is 100+ download and 4.9 upload, and nfs => remote nfs => local lustre client => local lustre server, via the same ethernet interface, is 180+ download and ~100 upload. With upload to remote server data are read from local /dev/zero and written to drbd volume, download from remote server data are read from file, and written to local ext3 volume, in both cases. ... Regards, DT -- Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP drbd 8.3.5-(api:88/proto:86-91) pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default On Sun, 8 Nov 2009, Piotr Wadas wrote:> >
One more thing - server is lustre 1.8.1.1, and client is lustre 1.8.1, didn''t upgraded kernels on clients yet. DT -- Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP drbd 8.3.5-(api:88/proto:86-91) pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default On Sun, 8 Nov 2009, Piotr Wadas wrote:> > To be exact, I got similar rates with hundred files , one megabyte each. > so it''s rather not about size of the files. > Lustre client rate is 100+ download and 4.9 upload, and > nfs => remote nfs => local lustre client => local lustre server, > via the same ethernet interface, is 180+ download and ~100 upload. > > With upload to remote server data are read from local /dev/zero and > written to drbd volume, download from remote server data are read from > file, and written to local ext3 volume, in both cases. > > ... > > Regards, > DT > >
On Sun, 2009-11-08 at 21:52 +0100, Piotr Wadas wrote:> I just did some speed tests between client and filesystem server, > with dedicated GbitEthernet connection, I compared uploading via > lustre-mounted share, and uploading to the same share, mounted > as loopback lustre client on filesystem server and reexported via nfs.I''m not sure I understand the configuration of your "loopback" set up. Could you provide more details as to what exactly you mean?> nfs clinet => nfs server => loopback lustre server => drbd resource > "X".I don''t understand why you need to introduce NFS in all of this.> aleft:~# mount -t lustre > /dev/drbd0 on /mnt/mgs type lustre (rw,noauto) > /dev/drbd1 on /mnt/mdt type lustre (rw,noauto,_netdev) > /dev/drbd2 on /mnt/ost01 type lustre (rw,noauto,_netdev) > master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)So "master" is your lustre server with both mds and ost on it? This will be a sub-optimal configuration due to the seeking being done between the MDT and OST. If you really do have only one machine available to you, then likely plain old NFS will perform better for you. Lustre doesn''t really begin to shine until you can throw more resources at it. In that situation, Lustre starts to outperform NFS by efficiently utilizing the many machines you give it.> b02:~# mount -t lustre > master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev) > b02:~# mount -t nfs |grep master > master:/mnt/lfs00 on /mnt/nfs00 type nfs (rw,addr=192.168.0.100)Ahhh. Maybe your NFS scenario is more clear. On the "master" server you have mounted a Lustre client and have exported that via NFS?> time dd if=/dev/zero of=/mnt/lfs00/testfile-b02 bs=1024 count=102400^^^^^^^ ^^^^^^^^^^^^ Try increasing the block size (and reducing the count if you want to send the same amount of data). Try a block size of 1M (and count of 100 to make the dataset the same size if you wish).> lfs00-get > time dd of=testfile-b02 if=/mnt/lfs00/testfile-b02 bs=1024 > count=102400 > 102400+0 records in > 102400+0 records out > 104857600 bytes (105 MB) copied, 0.987265 s, 106 MB/s > > real 0m0.989s > user 0m0.040s > sys 0m0.880sThis result is likely demonstrating readahead.> b02:~# ./100mb-nfs.sh > > nfs00-send > time dd if=/dev/zero of=/mnt/nfs00/testfile-b02 bs=1024 count=102400 > 102400+0 records in > 102400+0 records out > 104857600 bytes (105 MB) copied, 1.05942 s, 99.0 MB/sProbably something in the NFS stack is coalescing the small writes into larger writes before sending them over the wire to the server.> nfs00-get > time dd of=testfile-b02 if=/mnt/nfs00/testfile-b02 bs=1024 > count=102400 > 102400+0 records in > 102400+0 records out > 104857600 bytes (105 MB) copied, 0.576351 s, 182 MB/sWhat is /dev/drbd2? Is it a pair of (individual) disks or an array of some sort? If individual disks, you have to agree that 182 MB/s is unrealistic, yes? Likely you are measuring the speed of caching here. Try increasing your dataset size so that it exceeds the ability of the cache to help out. Probably a few 10s of GB will do it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091109/8148838c/attachment.bin