Hi,
Here's our QDR IB gluster setup:
http://piranha.structbio.vanderbilt.edu
We're still using gluster 3.0 on all our servers and clients as well
as CENTOS5.6 kernels and ofed 1.4. To simulate a single stream I use
this nfsSpeedTest script I wrote :
http://code.google.com/p/nfsspeedtest/
>From a single QDR IB connected client to our /pirstripe directory
which is a stripe of the gluster storage servers, this is the
performance I get (note use a file size > amount of RAM on client and
server systems, 13GB in this case) :
4k block size :
111 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 142.281 MB/s 1138.247 mbps 93.561 seconds
pir4: Read test (dd): 274.321 MB/s 2194.570 mbps 48.527 seconds
testing from 8k - 128k block size on the dd, best performance was
achieved at 64k block sizes:
114 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -b 64k -y
pir4: Write test (dd): 213.344 MB/s 1706.750 mbps 62.397 seconds
pir4: Read test (dd): 955.328 MB/s 7642.620 mbps 13.934 seconds
This is to the /pirdist directories which are mounted in distribute
mode (file is written to only one of the gluster servers) :
105 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 182.410 MB/s 1459.281 mbps 72.978 seconds
pir4: Read test (dd): 244.379 MB/s 1955.033 mbps 54.473 seconds
106 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
pir4: Write test (dd): 204.297 MB/s 1634.375 mbps 65.160 seconds
pir4: Read test (dd): 340.427 MB/s 2723.419 mbps 39.104 seconds
For reference/control, here's the same test writing straight to the
XFS filesystem on one of the gluster storage nodes:
[sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y
gluster1: Write test (dd): 398.971 MB/s 3191.770 mbps 33.366 seconds
gluster1: Read test (dd): 234.563 MB/s 1876.501 mbps 56.752 seconds
[sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
gluster1: Write test (dd): 442.251 MB/s 3538.008 mbps 30.101 seconds
gluster1: Read test (dd): 219.708 MB/s 1757.660 mbps 60.590 seconds
The read test seems to scale linearly with the # of storage servers
(almost 1GB/s!). Interestingly, the /pirdist read test at 64k block
size was 120MB/s faster than the read test straight from XFS, however,
it could have been that gluster1 was busy and when I read from
/pirdist the file was actually being read from one of the other 4 less
busy storage nodes.
Here's our storage node setup (many of these settings may not apply to v3.2)
:
####
volume posix-stripe
type storage/posix
option directory /export/gluster1/stripe
end-volume
volume posix-distribute
type storage/posix
option directory /export/gluster1/distribute
end-volume
volume locks
type features/locks
subvolumes posix-stripe
end-volume
volume locks-dist
type features/locks
subvolumes posix-distribute
end-volume
volume iothreads
type performance/io-threads
option thread-count 16
subvolumes locks
end-volume
volume iothreads-dist
type performance/io-threads
option thread-count 16
subvolumes locks-dist
end-volume
volume server
type protocol/server
option transport-type ib-verbs
option auth.addr.iothreads.allow 10.2.178.*
option auth.addr.iothreads-dist.allow 10.2.178.*
option auth.addr.locks.allow 10.2.178.*
option auth.addr.posix-stripe.allow 10.2.178.*
subvolumes iothreads iothreads-dist locks posix-stripe
end-volume
####
Here's our stripe client setup :
####
volume client-stripe-1
type protocol/client
option transport-type ib-verbs
option remote-host gluster1
option remote-subvolume iothreads
end-volume
volume client-stripe-2
type protocol/client
option transport-type ib-verbs
option remote-host gluster2
option remote-subvolume iothreads
end-volume
volume client-stripe-3
type protocol/client
option transport-type ib-verbs
option remote-host gluster3
option remote-subvolume iothreads
end-volume
volume client-stripe-4
type protocol/client
option transport-type ib-verbs
option remote-host gluster4
option remote-subvolume iothreads
end-volume
volume client-stripe-5
type protocol/client
option transport-type ib-verbs
option remote-host gluster5
option remote-subvolume iothreads
end-volume
volume readahead-gluster1
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-stripe-1
end-volume
volume readahead-gluster2
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-stripe-2
end-volume
volume readahead-gluster3
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-stripe-3
end-volume
volume readahead-gluster4
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-stripe-4
end-volume
volume readahead-gluster5
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-stripe-5
end-volume
volume writebehind-gluster1
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster1
end-volume
volume writebehind-gluster2
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster2
end-volume
volume writebehind-gluster3
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster3
end-volume
volume writebehind-gluster4
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster4
end-volume
volume writebehind-gluster5
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster5
end-volume
volume quick-read-gluster1
type performance/quick-read
subvolumes writebehind-gluster1
end-volume
volume quick-read-gluster2
type performance/quick-read
subvolumes writebehind-gluster2
end-volume
volume quick-read-gluster3
type performance/quick-read
subvolumes writebehind-gluster3
end-volume
volume quick-read-gluster4
type performance/quick-read
subvolumes writebehind-gluster4
end-volume
volume quick-read-gluster5
type performance/quick-read
subvolumes writebehind-gluster5
end-volume
volume stat-prefetch-gluster1
type performance/stat-prefetch
#subvolumes quick-read-gluster1
subvolumes writebehind-gluster1
end-volume
volume stat-prefetch-gluster2
type performance/stat-prefetch
#subvolumes quick-read-gluster2
subvolumes writebehind-gluster2
end-volume
volume stat-prefetch-gluster3
type performance/stat-prefetch
#subvolumes quick-read-gluster3
subvolumes writebehind-gluster3
end-volume
volume stat-prefetch-gluster4
type performance/stat-prefetch
#subvolumes quick-read-gluster4
subvolumes writebehind-gluster4
end-volume
volume stat-prefetch-gluster5
type performance/stat-prefetch
#subvolumes quick-read-gluster5
subvolumes writebehind-gluster5
end-volume
volume stripe
type cluster/stripe
option block-size 2MB
#subvolumes client-stripe-1 client-stripe-2 client-stripe-3
client-stripe-4 client-stripe-5
#subvolumes readahead-gluster1 readahead-gluster2 readahead-gluster3
readahead-gluster4 readahead-gluster5
#subvolumes writebehind-gluster1 writebehind-gluster2
writebehind-gluster3 writebehind-gluster4 writebehind-gluster5
#subvolumes quick-read-gluster1 quick-read-gluster2
quick-read-gluster3 quick-read-gluster4 quick-read-gluster5
subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2
stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5
end-volume
####
Quick read was disabled because there was a bug that causes a crash
when that's enabled. This has been fixed in more recent versions but I
haven't upgraded. Here's our client distribute setup :
####
volume client-distribute-1
type protocol/client
option transport-type ib-verbs
option remote-host gluster1
option remote-subvolume iothreads-dist
end-volume
volume client-distribute-2
type protocol/client
option transport-type ib-verbs
option remote-host gluster2
option remote-subvolume iothreads-dist
end-volume
volume client-distribute-3
type protocol/client
option transport-type ib-verbs
option remote-host gluster3
option remote-subvolume iothreads-dist
end-volume
volume client-distribute-4
type protocol/client
option transport-type ib-verbs
option remote-host gluster4
option remote-subvolume iothreads-dist
end-volume
volume client-distribute-5
type protocol/client
option transport-type ib-verbs
option remote-host gluster5
option remote-subvolume iothreads-dist
end-volume
volume readahead-gluster1
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-distribute-1
end-volume
volume readahead-gluster2
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-distribute-2
end-volume
volume readahead-gluster3
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-distribute-3
end-volume
volume readahead-gluster4
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-distribute-4
end-volume
volume readahead-gluster5
type performance/read-ahead
option page-count 4 # 2 is default
option force-atime-update off # default is off
subvolumes client-distribute-5
end-volume
volume writebehind-gluster1
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster1
end-volume
volume writebehind-gluster2
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster2
end-volume
volume writebehind-gluster3
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster3
end-volume
volume writebehind-gluster4
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster4
end-volume
volume writebehind-gluster5
type performance/write-behind
option flush-behind on
subvolumes readahead-gluster5
end-volume
volume quick-read-gluster1
type performance/quick-read
subvolumes writebehind-gluster1
end-volume
volume quick-read-gluster2
type performance/quick-read
subvolumes writebehind-gluster2
end-volume
volume quick-read-gluster3
type performance/quick-read
subvolumes writebehind-gluster3
end-volume
volume quick-read-gluster4
type performance/quick-read
subvolumes writebehind-gluster4
end-volume
volume quick-read-gluster5
type performance/quick-read
subvolumes writebehind-gluster5
end-volume
volume stat-prefetch-gluster1
type performance/stat-prefetch
subvolumes quick-read-gluster1
end-volume
volume stat-prefetch-gluster2
type performance/stat-prefetch
subvolumes quick-read-gluster2
end-volume
volume stat-prefetch-gluster3
type performance/stat-prefetch
subvolumes quick-read-gluster3
end-volume
volume stat-prefetch-gluster4
type performance/stat-prefetch
subvolumes quick-read-gluster4
end-volume
volume stat-prefetch-gluster5
type performance/stat-prefetch
subvolumes quick-read-gluster5
end-volume
volume distribute
type cluster/distribute
#option block-size 2MB
#subvolumes client-distribute-1 client-distribute-2
client-distribute-3 client-distribute-4 client-distribute-5
option min-free-disk 1%
#subvolumes writebehind-gluster1 writebehind-gluster2
writebehind-gluster3 writebehind-gluster4 writebehind-gluster5
subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2
stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5
end-volume
####
I don't know why my writes are so slow compared to reads. Let me know
if you're able to get better write speeds with the newer version of
gluster and any of the configurations (if they apply) that I've
posted. It might compel me to upgrade.
HTH,
Sabuj Pattanayek
> For some background, our compute cluster has 64 compute nodes. The gluster
> storage pool has 10 Dell PowerEdge R515 servers, each with 12 x 2 TB disks.
> We have another 16 Dell PowerEdge R515s used as Lustre storage servers. The
> compute and storage nodes are all connected via QDR Infiniband. Both
Gluster
> and Lustre are set to use RDMA over Infiniband. We are using OFED version
> 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5 on both the compute and
storage
> nodes.
>
> Oddly, it seems like there's some sort of bottleneck on the client side
--
> for example, we're only seeing about 50 MB/s write throughput from a
single
> compute node when writing a 10GB file. But, if we run multiple simultaneous
> writes from multiple compute nodes to the same Gluster volume, we get 50
> MB/s from each compute node. However, running multiple writes from the same
> compute node does not increase throughput. The compute nodes have 48 cores
> and 128 GB RAM, so I don't think the issue is with the compute node
> hardware.
>
> With Lustre, on the same hardware, with the same version of OFED, we're
> seeing write throughput on that same 10 GB file as follows: 476 MB/s single
> stream write from a single compute node and aggregate performance of more
> like 2.4 GB/s if we run simultaneous writes. That leads me to believe that
> we don't have a problem with RDMA, otherwise Lustre, which is also
using
> RDMA, should be similarly affected.
>
> We have tried both xfs and ext4 for the backend file system on the Gluster
> storage nodes (we're currently using ext4). We went with distributed
(not
> distributed striped) for the Gluster volume -- the thought was that if
there
> was a catastrophic failure of one of the storage nodes, we'd only lose
the
> data on that node; presumably with distributed striped you'd lose any
data
> striped across that volume, unless I have misinterpreted the documentation.
>
> So ... what's expected/normal throughput for Gluster over QDR IB to a
> relatively large storage pool (10 servers / 120 disks)? Does anyone have
> suggested tuning tips for improving performance?
>
> Thanks!
>
> John
>
> --
>
> ________________________________________________________
>
> John Lalande
> University of Wisconsin-Madison
> Space Science & Engineering Center
> 1225 W. Dayton Street, Room 439, Madison, WI 53706
> 608-263-2268?/ john.lalande at ssec.wisc.edu
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>