thr3ads.net - Gluster users - [Gluster-users] Recommendations for busy static web server replacement [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Carsten Aulbert

2012-Feb-07 08:59 UTC

[Gluster-users] Recommendations for busy static web server replacement

Hi all

after being a silent reader for some time and not very successful in getting 
good performance out of our test set-up, I'm finally getting to the list
with
questions.

Right now, we are operating a web server serving out 4MB files for a 
distributed computing project. Data is requested from all over the world at a 
rate of about 650k to 800k downloads a day. Each data file is usually only 
ever read 2-3 times and after some time deleted again. Thus typical data rates 
are about 30MB/s day in and out with a constant influx of new data at about 
have that rate. Also, file deletion is an ever ongoing process.

Currently, this machine is running on a hardware RAID6 with a 10TB xfs volume 
serving the needs. But even after optimizing the io scheduler we are hitting a 
limit here. Obviously, going with raid 10 should be faster, but we are now 
aiming for a more scalable solution, e.g. glusterfs.

Our idea is to have something like a web server at the front which would 
process the request and get data from the storage pool in the background. And 
if need be, we could add more storage bricks for better scalability.

Our current idea was to use a distributed/replicated set-up with 2n servers. 
For testing, I have the following systems available:

up to 10 server with 12 SATA data disks and md-software raid (OS is on SATA 
DoM) and one "web server". All of these are connected by 10GbE. Each
server
has 16 or 24 GB of RAM (currently limited to 500MB as I currently don't want
to test caching) and multiple cores (at least 4 cores @2GHz, plus HT).

For stress testing I can use a large number of computers which e.g. use curl 
for downloading the files to /dev/null.

I did a few tests, mostly to get used to glusterfs first, but so far 
performance had not been too well.

(1) two servers with raid0 over all 12 disks, each serving as a single storage 
brick in simple replicated setup. "web server node" mounted this via
FUSE and
I created several 1000 of files at a rate of about 115MB/s. Then nginx served 
files to 50 clients. Each client now downloaded 100 4MB files + 100 small 
files which held the md5sums of each other file for validating files. 

On average each client took about 7.25minutes and on the web server I've
only
seen ~ 46MB/s throughput.

(2) same two server now each exporting each disk on its own, i.e.

gluster volume create test-volume replica 2 transport tcp $(for i in b c d e f 
g h i j k l m ; do for n in 1 2; do echo -n "gluster0$n:/data-$i ";
done;
done)

As expected, the overhead here is larger. initial file creation started slowly 
at 45MB/s and peaked around 105MB/s, 50 clients saw a total bandwidth of about 
41MB/s

(3) I ran other tests across all 10 backend bricks, but never went beyond ~ 
80MB/s when using 4 disk raid0 over 10 servers.

All tests were run with glusterfs 3.2.5 on Debian Squeeze, md-software raid, 
xfs file system, "default" settings for all of these, except using the
deadline IO scheduler.

Now the big question: For our next generation production system, how can we 
get a (much) better performance. As I'm not seeing that our disks are 100% 
utilized nor the network is loaded at all, and the bricks being pretty idle, I 
currently suspect that glusterfs needs some serious tuning here. On the 
webserver I do see a single thread taking about 80% of user cpu which will 
never get beyond that and about 10-15% system CPU usage. But so far with 
blindly poking around, I have yet to hit the right setting.

Ideally, I'd like to have a set-up, where multiple relatively cheap
computers
with say 4 disks each run in raid0 or raid 10 or no raid and export this via 
glusterfs to our web server. Gluster's replication will serve as kind of
fail-
safe net and data redistribution will help, when we add more similar machines 
later on to counter increased usage.

Thanks a lot in advance for at least reaching the end of my email, any help 
appreciated.

Cheers

Carsten

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6
CaCert Assurer | Get free certificates from http://www.cacert.org/

Brian Candler

2012-Feb-07 13:50 UTC

head link

[Gluster-users] Recommendations for busy static web server replacement

On Tue, Feb 07, 2012 at 09:59:44AM +0100, Carsten Aulbert
wrote:> (1) two servers with raid0 over all 12 disks, each serving as a single
storage
> brick in simple replicated setup.
I am doing some similar tests at the moment.

1. What's your stripe size? If your files are typically 4MB, then using a
4MB or larger stripe size will mean that most requests are serviced from a
single disk.  This will give higher latency for a single client but leave
lots of spindles free for other concurrent clients, maximising your total
throughput.

If you have a stripe size of 1MB, then each file read will need to seek on 4
disks.  This gives you longer rotational latency (on average close to a full
rotation instead of 1/2 a rotation), but 1/4 of the transfer time.  This
might be a good tradeoff for single clients, but could reduce your total
throughput with many concurrent clients.

Anything smaller is likely to suck.

2. Have you tried RAID10 in "far" mode? e.g.

mdadm --create /dev/md/raid10 -n 12 -c 4096 -l raid10 -p f2 -b internal
/dev/sd{h..s}

The advantage here is that all the data can be read off the first half of
each disk, which means shorter seek times and also higher transfer rates
(the MB/sec at the outside of the disk is about twice the MB/sec at the
centre of the disk)

The downside is more seeking for writes, which may or may not pay off with
your 3:1 ratio. As long as there is write-behind going on, I think it may.

Since each node has RAID10 disk protection then you could use a simple
distributed setup on top of it (at the cost of losing the ability to take
a whole storage node out of service). Or you could have twice as many disks.

3. When you mount your XFS filesystems, do you provide the 'inode64'
mount
option?  This can be critical for filesystems >1TB to get decent
performance, as I found out the hard way.
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

"noatime" and "nodiratime" can be helpful too.

4. Have you tuned read_ahead_kb and max_sectors_kb? On my system defaults
are 128 and 512 respectively.

for i in /sys/block/sd*/bdi/read_ahead_kb; do echo 1024 >"$i"; done
for i in /sys/block/sd*/queue/max_sectors_kb; do echo 1024 >"$i";
done

5. Have you tried apache or apache2 instead of nginx? Have you done any
testing directly on the mount point, not using a web server?
> Ideally, I'd like to have a set-up, where multiple relatively cheap
computers
> with say 4 disks each run in raid0 or raid 10 or no raid and export this
via
> glusterfs to our web server. Gluster's replication will serve as kind
of fail-
> safe net and data redistribution will help, when we add more similar
machines
> later on to counter increased usage.
I am currently building a similar test rig to yours, but with 24 disk bays
per 4U server.  There are two LSI HBAs, one 16 port and one 8 port.

The HBAs are not the bottleneck (I can dd data to and from all the disks at
once no problem), and the CPUs are never very busy.  One box has an i3-2130
3.4GHz processor (dual core hyperthreaded), and the other a Xeon E3-1225
3.1GHz (quad core, no hyperthreading)

We're going this way because we need tons of storage packed into a rack in a
constrained power budget, but you might also find that fewer big servers are
better than lots of tiny ones. I'd consider at least 2U with 12 hot-swap
bays.

I have yet to finish my testing, but here are two relevant results:

(1) with a single 12-disk RAID10 array with 1MB chunk size, shared using
glusterfs over 10GE to another machine, serving files between 500k and 800k,
from the client I can read 180 random files per second (117MB/s) with 20
concurrent processes, or 206 random files per second (134MB/s) with 30
concurrent processes.

For comparison, direct local access to the filesystem on the RAID10 array
gives 291 files/sec (189MB/sec) and 337 files/sec (219MB/sec) with 20 or 30
concurrent readers.

However, the gluster performance at 1/2/5 concurrent readers tracks the
direct RAID10 closely, but falls off above that.  So I think there may be
some gluster concurrency tuning required.

(2) in another configuration, I have 6 disks in one server and 6 in the
other, with twelve separate XFS filesystems, combined into a distributed
replicated array (much like yours but with half the spindles).  The gluster
volume is mounted on one of the servers, which is where I run the test, so 6
disks are local and 6 are remote.  Serving the same corpus of files I can
read 177 random files per second (115MB/s) with 20 concurrent readers, or
198 files/sec (129MB/s) with 30 concurrent readers.

The corpus is 100K files, so about 65GB in total, and the machines have 8GB
RAM.  Each test drops caches first: http://linux-mm.org/Drop_Caches

I have no web server layer in front of this - I'm using a ruby script which
forks and fires off 'dd' processes to read the files from the gluster
mountpoint.

However I am using low performance 5940 RPM drives (Hitachi Deskstar 5K3000
HDS5C3030ALA630) because they are cheap, use little power, and are reputedly
very reliable.  If you're using anything better than these you should be
able to improve on my numbers.

I haven't compared to NFS, which might be an option for you if you can live
without the node-to-node replication features of glusterfs.

Regards,

Brian.

Maybe Matching Threads

Search for more seemingly similar threads

Gluster users - Feb 2012 - Recommendations for busy static web server replacement

[Gluster-users] Recommendations for busy static web server replacement

[Gluster-users] Recommendations for busy static web server replacement

Maybe Matching Threads