thr3ads.net - Gluster users - [Gluster-users] GlusterFS performance, concurrency and I/O blocking [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Ken Randall

2011-Aug-23 18:17 UTC

[Gluster-users] GlusterFS performance, concurrency and I/O blocking

Hi everybody! Love this community, and I love GlusterFS.

All that, despite being burned by it, likely due to my own failures. Here's
the scenario where I got burned, and my guesstimates on why they happened.

We run a popular .NET-based web app that gets a lot of traffic, where people
build websites using our system. The long short of it is, we tested,
tweaked, tested some more, over a full month. After we deployed it to
production, we saw performance take a dive into the dumpster. We had to
revert back fairly quickly.

The obvious blame is in our testing. We load tested the system many, many
times over the course of an entire month, but with a narrow range of test
scenarios. The wide range of live production traffic proved to render our
testing moot. We tucked our tail between our legs and are researching tools
that will let us play back life traffic to serve as a better simulation. In
our earlier load-testing we were able to achieve many multiples of our peak
traffic, but again it wasn't realistic traffic.

Before I get to my suspicion of what's happening, keep in mind that we have
50+ million files (over hundreds of thousands of directories), most of them
are small, and each page request will pull in upwards of 10-40 supporting
assets (images, Flash files, CSS, JS, etc.). We also have people executing
directory listings whenever they're editing their site, as they choose
images, etc. to insert onto the page. We're also exporting the volume to
CIFS so our Windows servers can access the GlusterFS client on the Linux
machines in the cluster. The Samba settings on there were tweaked to the
hilt as well, turning off case-insensitivity, bumping up caches and async
IO, etc.

It appears as if GlusterFS has some kind of I/O blocking going on. Whenever
a directory listing is being pieced together, it noticeably slows down (or
stops?) other operations through the same client. For a high-concurrency
app like ours where the storage backend needs to be able to pull off 10 to
100 directory listings a second, and 5,000 to 10,000 IOPS overall, it's easy
to see how perf would degrade if my blocking suspicion is correct. The
biggest culprit, in my guess, is the directory listing. Executing one makes
things drag. I've been able to demonstrate that through a simple script.
And we're running some pretty monster machines with 24 cores, 24 GB RAM,
etc.

I tried as many tuning permutations as possible, only to run into the same
result. Jacking the cache-size, the io-thread-count to 64, etc. certainly
helped performance, but continued to exhibit this blocking behavior. I also
made sure that each web server accessing the GlusterFS backend was talking
to its own GlusterFS client, in the hopes of increasing parallelization.
I'm sure it helped, but not enough. It's nowhere close to the
concurrency
and performance of a straight-out Windows share. (I realize the overhead of
a clustered file system will have less perf than a straight share, but we
saw a drop of performance as load increased, in the order of magnitude
range.)

Am I way off? Does GlusterFS block on directory listings (getdents) or any
other operations? If so, is there a way to enable the database equivalent
of "dirty reads" so it doesn't block?

Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110823/a0f25060/attachment.html>

Joe Landman

2011-Aug-23 18:29 UTC

head link

[Gluster-users] GlusterFS performance, concurrency and I/O blocking

On 08/23/2011 02:17 PM, Ken Randall wrote:> Hi everybody!  Love this community, and I love GlusterFS.
[...]
> Before I get to my suspicion of what's happening, keep in mind that we
> have 50+ million files (over hundreds of thousands of directories), most
> of them are small, and each page request will pull in upwards of 10-40
> supporting assets (images, Flash files, CSS, JS, etc.).  We also have
> people executing directory listings whenever they're editing their
site,
> as they choose images, etc. to insert onto the page.  We're also
> exporting the volume to CIFS so our Windows servers can access the
> GlusterFS client on the Linux machines in the cluster.  The Samba
> settings on there were tweaked to the hilt as well, turning off
> case-insensitivity, bumping up caches and async IO, etc.
>
> It appears as if GlusterFS has some kind of I/O blocking going on.
> Whenever a directory listing is being pieced together, it noticeably
> slows down (or stops?) other operations through the same client.  For a
> high-concurrency app like ours where the storage backend needs to be
> able to pull off 10 to 100 directory listings a second, and 5,000 to
> 10,000 IOPS overall, it's easy to see how perf would degrade if my
> blocking suspicion is correct.  The biggest culprit, in my guess, is the
> directory listing.  Executing one makes things drag.  I've been able to
> demonstrate that through a simple script.  And we're running some
pretty
> monster machines with 24 cores, 24 GB RAM, etc.
[...]

> Am I way off?  Does GlusterFS block on directory listings (getdents) or
> any other operations?  If so, is there a way to enable the database
> equivalent of "dirty reads" so it doesn't block?
What was the back end file system and how did you construct it?

Many folks use the ext* series based upon recommendations from the 
company.  ext* is highly contra-indicated for highly parallel 
operations.  It simply cannot keep up with file systems designed for this.

Another area are in the extended attributes.  If you size the system or 
the extended attributes wrong, you will lose performance.

And (note we are biased given what we build/sell), hardware matters. 
The vast majority of hardware we've seen people stand up themselves for 
this, has been badly underpowered for massive scale IOs.  It usually 
starts with someone spec'ing a 6G backplane, and rapidly goes south from 
there.  It doesn't matter how fast the software layers are, if the 
hardware simply cannot keep up.  In the vast majority of cases where 
we've been called in to look at someones setup, yeah, the hardware 
played a huge role in the (lack of) performance.
>
> Ken
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-Aug-23 18:36 UTC

head link

[Gluster-users] GlusterFS performance, concurrency and I/O blocking

directory listing over 50 million files will definitely slow things
down if environment is not setup to handle that kind of IO. I am
thinking blocking could be due to network throttling, disk spins while
random IO is happening. Were you able to look at the network
utilization during this time period? You can install iftop to monitor
network utilizaiton. Also, there is a bug related to small files that
causes performance issues.

Bug - http://bugs.gluster.com/show_bug.cgi?id=2869

Definitely do some benchmark and see your IO requirements. This will
help you tune your system. But in general directory listings are slow
and consumes network too. In addition watch iostats and look for await
and %util.


On Tue, Aug 23, 2011 at 11:17 AM, Ken Randall <rushian85 at gmail.com>
wrote:> Hi everybody!? Love this community, and I love GlusterFS.
>
> All that, despite being burned by it, likely due to my own failures.?
Here's
> the scenario where I got burned, and my guesstimates on why they happened.
>
> We run a popular .NET-based web app that gets a lot of traffic, where
people
> build websites using our system.? The long short of it is, we tested,
> tweaked, tested some more, over a full month.? After we deployed it to
> production, we saw performance take a dive into the dumpster.? We had to
> revert back fairly quickly.
>
> The obvious blame is in our testing.? We load tested the system many, many
> times over the course of an entire month, but with a narrow range of test
> scenarios.? The wide range of live production traffic proved to render our
> testing moot.? We tucked our tail between our legs and are researching
tools
> that will let us play back life traffic to serve as a better simulation.?
In
> our earlier load-testing we were able to achieve many multiples of our peak
> traffic, but again it wasn't realistic traffic.
>
> Before I get to my suspicion of what's happening, keep in mind that we
have
> 50+ million files (over hundreds of thousands of directories), most of them
> are small, and each page request will pull in upwards of 10-40 supporting
> assets (images, Flash files, CSS, JS, etc.).? We also have people executing
> directory listings whenever they're editing their site, as they choose
> images, etc. to insert onto the page.? We're also exporting the volume
to
> CIFS so our Windows servers can access the GlusterFS client on the Linux
> machines in the cluster.? The Samba settings on there were tweaked to the
> hilt as well, turning off case-insensitivity, bumping up caches and async
> IO, etc.
>
> It appears as if GlusterFS has some kind of I/O blocking going on.?
Whenever
> a directory listing is being pieced together, it noticeably slows down (or
> stops?) other operations through the same client.? For a high-concurrency
> app like ours where the storage backend needs to be able to pull off 10 to
> 100 directory listings a second, and 5,000 to 10,000 IOPS overall, it's
easy
> to see how perf would degrade if my blocking suspicion is correct.? The
> biggest culprit, in my guess, is the directory listing.? Executing one
makes
> things drag.? I've been able to demonstrate that through a simple
script.
> And we're running some pretty monster machines with 24 cores, 24 GB
RAM,
> etc.
>
> I tried as many tuning permutations as possible, only to run into the same
> result.? Jacking the cache-size, the io-thread-count to 64, etc. certainly
> helped performance, but continued to exhibit this blocking behavior.? I
also
> made sure that each web server accessing the GlusterFS backend was talking
> to its own GlusterFS client, in the hopes of increasing parallelization.
> I'm sure it helped, but not enough.? It's nowhere close to the
concurrency
> and performance of a straight-out Windows share.? (I realize the overhead
of
> a clustered file system will have less perf than a straight share, but we
> saw a drop of performance as load increased, in the order of magnitude
> range.)
>
> Am I way off?? Does GlusterFS block on directory listings (getdents) or any
> other operations?? If so, is there a way to enable the database equivalent
> of "dirty reads" so it doesn't block?
>
> Ken
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>

Gluster users - Aug 2011 - GlusterFS performance, concurrency and I/O blocking

[Gluster-users] GlusterFS performance, concurrency and I/O blocking

[Gluster-users] GlusterFS performance, concurrency and I/O blocking

[Gluster-users] GlusterFS performance, concurrency and I/O blocking