thr3ads.net - Lustre discuss - [Lustre-discuss] 1.8.1.1 write slow performance :/ [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Piotr Wadas

2009-Nov-08 20:52 UTC

[Lustre-discuss] 1.8.1.1 write slow performance :/

-- 
Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP
drbd 8.3.5-(api:88/proto:86-91)
pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default

Well, I''v setup everything using 64bit kernel, for now I got 
~4 TB of usable space with one lustre fs ost volume.

I just did some speed tests between client and filesystem server,
with dedicated GbitEthernet connection, I compared uploading via 
lustre-mounted share, and uploading to the same share, mounted
as loopback lustre client on filesystem server and reexported via nfs.

Results are quite sad, I have dreadfully slow write to remote lustrefs
directly, while write to lustre fs reexported via nfs is at least 10 times 
faster.. Client machine is Xeon 2.4 with 4GB RAM, and server machine is 
Xeon 3.0Gh with 8GB ram. I reviewed tuning chapter from lustre manual,
tuned rx of ethernet interface with ethtool. 
Lustre volumes (mgs,mdt,ost) are set up on UpToDate (synchronized) drbd
resources (synchronization already finished, via dedicated 1Gbit link, 
not the same interface used to communicate with lustre clients.)

I''d blame drbd for this, well, some cost is expected with drbd,
but nfs-reexported locally-mounte lfs volume obviously goes through
drbd stack too! DRBD resource is setup as backend storage device for
lustre, so actually it''s not possible to write or read anything from/to
lustre with skipping drbd stack. Machines are load-free.

Seems, that with client-initiated write, the way

lustre client => lustre server => drbd resource "X"

is dramatically slower than

nfs clinet => nfs server => loopback lustre server => drbd resource
"X".

And this is definitely not expected. Below are example transfer rates.
Any ideas for this? Is this, for example, some difference between nfs 
and lustre for in-the-middle gigabit switch performance ?

aleft:~# free -m
             total       used       free     shared    buffers     cached
Mem:          7987       3861       4126          0        102       3475
-/+ buffers/cache:        282       7705
Swap:         1906          0       1906
aleft:~# logout
Connection to master closed.
b02:~# free -m
             total       used       free     shared    buffers     cached
Mem:          4054       3908        145          0         43       1813
-/+ buffers/cache:       2051       2002
Swap:         7812          0       7812
b02:~# 

b02:~# ssh root at master
[..]
aleft:~# mount -t lustre
/dev/drbd0 on /mnt/mgs type lustre (rw,noauto)
/dev/drbd1 on /mnt/mdt type lustre (rw,noauto,_netdev)
/dev/drbd2 on /mnt/ost01 type lustre (rw,noauto,_netdev)
master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)
aleft:~# logout
b02:~# mount -t lustre
master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)
b02:~# mount -t nfs |grep master
master:/mnt/lfs00 on /mnt/nfs00 type nfs (rw,addr=192.168.0.100)
b02:~# 

Connection to master closed.
b02:~# ./100mb.sh 

lfs00-send
time dd if=/dev/zero of=/mnt/lfs00/testfile-b02 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 22.3427 s, 4.7 MB/s

real    0m22.345s
user    0m0.100s
sys     0m3.760s

lfs00-get
time dd of=testfile-b02 if=/mnt/lfs00/testfile-b02 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 0.987265 s, 106 MB/s

real    0m0.989s
user    0m0.040s
sys     0m0.880s
b02:~# ./100mb-nfs.sh 

nfs00-send
time dd if=/dev/zero of=/mnt/nfs00/testfile-b02 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 1.05942 s, 99.0 MB/s

real    0m1.061s
user    0m0.028s
sys     0m0.252s

nfs00-get
time dd of=testfile-b02 if=/mnt/nfs00/testfile-b02 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 0.576351 s, 182 MB/s

real    0m0.578s
user    0m0.016s
sys     0m0.556s
b02:~#

Piotr Wadas

2009-Nov-08 21:13 UTC

head link

[Lustre-discuss] 1.8.1.1 write slow performance :/

To be exact, I got similar rates with hundred files , one megabyte each.
so it''s rather not about size of the files. 
Lustre client rate is 100+ download and 4.9 upload, and 
nfs => remote nfs => local lustre client => local lustre server,
via the same ethernet interface, is 180+ download and ~100 upload.

With upload to remote server data are read from local /dev/zero and 
written to drbd volume, download from remote server data are read from 
file, and written to local ext3 volume, in both cases.

...

Regards,
DT

-- 
Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP
drbd 8.3.5-(api:88/proto:86-91)
pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default

On Sun, 8 Nov 2009, Piotr Wadas wrote:
> 
>

Piotr Wadas

2009-Nov-08 21:15 UTC

head link

[Lustre-discuss] 1.8.1.1 write slow performance :/

One more thing - server is lustre 1.8.1.1, and client is lustre 1.8.1,
didn''t upgraded kernels on clients yet.
DT

-- 
Linux aleft 2.6.27.29-0.1_lustre.1.8.1.1-default #1 SMP
drbd 8.3.5-(api:88/proto:86-91)
pacemaker 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
Lustre 1.8.1.1-20091009080716-PRISTINE-2.6.27.29-0.1_lustre.1.8.1.1-default

On Sun, 8 Nov 2009, Piotr Wadas wrote:
> 
> To be exact, I got similar rates with hundred files , one megabyte each.
> so it''s rather not about size of the files. 
> Lustre client rate is 100+ download and 4.9 upload, and 
> nfs => remote nfs => local lustre client => local lustre server,
> via the same ethernet interface, is 180+ download and ~100 upload.
> 
> With upload to remote server data are read from local /dev/zero and 
> written to drbd volume, download from remote server data are read from 
> file, and written to local ext3 volume, in both cases.
> 
> ...
> 
> Regards,
> DT
> 
>

Brian J. Murrell

2009-Nov-09 15:50 UTC

head link

[Lustre-discuss] 1.8.1.1 write slow performance :/

On Sun, 2009-11-08 at 21:52 +0100, Piotr Wadas wrote:> I just did some speed tests between client and filesystem server,
> with dedicated GbitEthernet connection, I compared uploading via 
> lustre-mounted share, and uploading to the same share, mounted
> as loopback lustre client on filesystem server and reexported via nfs.
I''m not sure I understand the configuration of your
"loopback" set up.
Could you provide more details as to what exactly you mean?
> nfs clinet => nfs server => loopback lustre server => drbd
resource
> "X".
I don''t understand why you need to introduce NFS in all of this.
> aleft:~# mount -t lustre
> /dev/drbd0 on /mnt/mgs type lustre (rw,noauto)
> /dev/drbd1 on /mnt/mdt type lustre (rw,noauto,_netdev)
> /dev/drbd2 on /mnt/ost01 type lustre (rw,noauto,_netdev)
> master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)
So "master" is your lustre server with both mds and ost on it?  This
will be a sub-optimal configuration due to the seeking being done
between the MDT and OST.  If you really do have only one machine
available to you, then likely plain old NFS will perform better for you.

Lustre doesn''t really begin to shine until you can throw more resources
at it.  In that situation, Lustre starts to outperform NFS by
efficiently utilizing the many machines you give it.
> b02:~# mount -t lustre
> master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)
> b02:~# mount -t nfs |grep master
> master:/mnt/lfs00 on /mnt/nfs00 type nfs (rw,addr=192.168.0.100)
Ahhh. Maybe your NFS scenario is more clear.  On the "master" server
you
have mounted a Lustre client and have exported that via NFS?
> time dd if=/dev/zero of=/mnt/lfs00/testfile-b02 bs=1024 count=102400                                                  ^^^^^^^ ^^^^^^^^^^^^
Try increasing the block size (and reducing the count if you want to
send the same amount of data).  Try a block size of 1M (and count of 100
to make the dataset the same size if you wish).
> lfs00-get
> time dd of=testfile-b02 if=/mnt/lfs00/testfile-b02 bs=1024
> count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 0.987265 s, 106 MB/s
> 
> real    0m0.989s
> user    0m0.040s
> sys     0m0.880s
This result is likely demonstrating readahead.
> b02:~# ./100mb-nfs.sh 
> 
> nfs00-send
> time dd if=/dev/zero of=/mnt/nfs00/testfile-b02 bs=1024 count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 1.05942 s, 99.0 MB/s
Probably something in the NFS stack is coalescing the small writes into
larger writes before sending them over the wire to the server.
> nfs00-get
> time dd of=testfile-b02 if=/mnt/nfs00/testfile-b02 bs=1024
> count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 0.576351 s, 182 MB/s
What is /dev/drbd2?  Is it a pair of (individual) disks or an array of
some sort?  If individual disks, you have to agree that 182 MB/s is
unrealistic, yes?  Likely you are measuring the speed of caching here.

Try increasing your dataset size so that it exceeds the ability of the
cache to help out.  Probably a few 10s of GB will do it.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091109/8148838c/attachment.bin

Lustre discuss - Nov 2009 - 1.8.1.1 write slow performance :/

[Lustre-discuss] 1.8.1.1 write slow performance :/

[Lustre-discuss] 1.8.1.1 write slow performance :/

[Lustre-discuss] 1.8.1.1 write slow performance :/

[Lustre-discuss] 1.8.1.1 write slow performance :/