thr3ads.net - Gluster users - [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Matt

2015-Feb-03 05:46 UTC

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Hello List,

So I've been frustraded by intermittent performance problems throughout
January. The problem occurs on a two node setup running 3.4.5, 16 gigs
of ram with a bunch of local disk. For sometimes an hour for sometimes
weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes
will get their CPUs pegged, and in vmstat they'll show extremely high
numbers of context switches and interrupts. Eventually things calm
down. During this time, memory usage actually drops. Overall usage on
the box goes from between 6-10 gigs to right around 4 gigs, and stays
there. That's what really puzzles me.

When performance is problematic, sar shows one device, the device
corresponding to the glusterfsd problem using all the CPU doing lots of
little reads, Sometimes 70k/second, very small avg rq size, say 10-12.
Afraid I don't have any saved output handy, but I can try to capture
some next time it happens. I have tons of information frankly, but am
trying to keep this reasonably brief.

There are more than a dozen volumes on this two node setup. The CPU
usage is pretty much entirely contained to one volume, a 1.5 TB volume
that is just shy of 70% full. It stores uploaded files for a web app.
What I hate about this app and so am always suspicious of, is that it
stores a directory for every user in one level, so under the /data
directory in the volume, there are 450,000 sub directories at this
point.

The only real mitigation step that's been taken so far was to turn off
the self-heal daemon on the volume, as I thought maybe crawling that
large directory was getting expensive. This doesn't seem to have done
anything as the problem still occurs.

At this point I figure there are one of two things sorts of things
happening really broadly: one we're running into some sort of bug or
performance problem with gluster we should either fix perhaps by
upgrading or tuning around, or two, some process we're running but not
aware of is hammering the file system causing problems.

If it's the latter option, can anyone give me any tips on figuring out
what might be hammering the system? I can use volume top to see what a
brick is doing, but I can't figure out how to tell what clients are
doing what.

Apologies for the somewhat broad nature of the question, any input
thoughts would be much appreciated. I can certainly provide more info
about some things if it would help, but I've tried not to write a novel
here.

Thanks,

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150202/63696dfd/attachment.html>

Justin Clift

2015-Feb-03 10:58 UTC

head link

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

----- Original Message -----> Hello List,
> 
> So I've been frustraded by intermittent performance problems throughout
> January. The problem occurs on a two node setup running 3.4.5, 16 gigs
> of ram with a bunch of local disk. For sometimes an hour for sometimes
> weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes
> will get their CPUs pegged, and in vmstat they'll show extremely high
> numbers of context switches and interrupts. Eventually things calm
> down. During this time, memory usage actually drops. Overall usage on
> the box goes from between 6-10 gigs to right around 4 gigs, and stays
> there. That's what really puzzles me.
> 
> When performance is problematic, sar shows one device, the device
> corresponding to the glusterfsd problem using all the CPU doing lots of
> little reads, Sometimes 70k/second, very small avg rq size, say 10-12.
> Afraid I don't have any saved output handy, but I can try to capture
> some next time it happens. I have tons of information frankly, but am
> trying to keep this reasonably brief.
> 
> There are more than a dozen volumes on this two node setup. The CPU
> usage is pretty much entirely contained to one volume, a 1.5 TB volume
> that is just shy of 70% full. It stores uploaded files for a web app.
> What I hate about this app and so am always suspicious of, is that it
> stores a directory for every user in one level, so under the /data
> directory in the volume, there are 450,000 sub directories at this
> point.
> 
> The only real mitigation step that's been taken so far was to turn off
> the self-heal daemon on the volume, as I thought maybe crawling that
> large directory was getting expensive. This doesn't seem to have done
> anything as the problem still occurs.
> 
> At this point I figure there are one of two things sorts of things
> happening really broadly: one we're running into some sort of bug or
> performance problem with gluster we should either fix perhaps by
> upgrading or tuning around, or two, some process we're running but not
> aware of is hammering the file system causing problems.
> 
> If it's the latter option, can anyone give me any tips on figuring out
> what might be hammering the system? I can use volume top to see what a
> brick is doing, but I can't figure out how to tell what clients are
> doing what.
> 
> Apologies for the somewhat broad nature of the question, any input
> thoughts would be much appreciated. I can certainly provide more info
> about some things if it would help, but I've tried not to write a novel
> here.
Out of curiosity, are you able to test using GlusterFS 3.6.2?  We've
had a bunch of pretty in-depth upstream testing at decent scale (100+
nodes) from 3.5.x onwards, with lots of performance issues identified
and fixed on the way through.

So, I'm kinda hopeful the problem you're describing is fixed in newer
releases. :D

Regards and best wishes,

Justin Clift

-- 
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

Pranith Kumar Karampuri

2015-Feb-05 11:14 UTC

head link

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

On 02/03/2015 11:16 AM, Matt wrote:> Hello List,
>
> So I've been frustraded by intermittent performance problems 
> throughout January. The problem occurs on a two node setup running 
> 3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an 
> hour for sometimes weeks at a time (I have extensive graphs in 
> OpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstat 
> they'll show extremely high numbers of context switches and 
> interrupts. Eventually things calm down. During this time, memory 
> usage actually drops. Overall usage on the box goes from between 6-10 
> gigs to right around 4 gigs, and stays there. That's what really 
> puzzles me.
>
> When performance is problematic, sar shows one device, the device 
> corresponding to the glusterfsd problem using all the CPU doing lots 
> of little reads, Sometimes 70k/second, very small avg rq size, say 
> 10-12. Afraid I don't have any saved output handy, but I can try to 
> capture some next time it happens. I have tons of information frankly, 
> but am trying to keep this reasonably brief.
>
> There are more than a dozen volumes on this two node setup. The CPU 
> usage is pretty much entirely contained to one volume, a 1.5 TB volume 
> that is just shy of 70% full. It stores uploaded files for a web app. 
> What I hate about this app and so am always suspicious of, is that it 
> stores a directory for every user in one level, so under the /data 
> directory in the volume, there are 450,000 sub directories at this point.
>
> The only real mitigation step that's been taken so far was to turn off 
> the self-heal daemon on the volume, as I thought maybe crawling that 
> large directory was getting expensive. This doesn't seem to have done 
> anything as the problem still occurs.
>
> At this point I figure there are one of two things sorts of things 
> happening really broadly: one we're running into some sort of bug or 
> performance problem with gluster we should either fix perhaps by 
> upgrading or tuning around, or two, some process we're running but not 
> aware of is hammering the file system causing problems.
>
> If it's the latter option, can anyone give me any tips on figuring out 
> what might be hammering the system? I can use volume top to see what a 
> brick is doing, but I can't figure out how to tell what clients are 
> doing what.
>
> Apologies for the somewhat broad nature of the question, any input 
> thoughts would be much appreciated. I can certainly provide more info 
> about some things if it would help, but I've tried not to write a 
> novel here.
>
> Thanks,Could you enable 'gluster volume profile <volname> start' for this
volume?
When next time this issue happens, keep collecting 'gluster volume 
profile <volname> info' outputs. Mail them and lets see what is
happening.

Pranith>
> -Matt
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150205/5b2bf98b/attachment.html>

Gluster users - Feb 2015 - Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins