thr3ads.net - Gluster users - [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers; [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Harry Mangalam

2012-Aug-10 21:16 UTC

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

running 3.3 distributed on IPoIB on 4 nodes, 1 brick per node.  Any idea
why, on one of those nodes, glusterfsd would go berserk, running up to 370%
CPU and driving load to >30 (file performance on the clients slows to a
crawl). While very slow, it continued to serve out files. This is the
second time this has happened in about a week. I had turned on the gluster
nfs services, but wasn't using it when this happened.  It's now off.

kill -HUP did nothing to either glusterd or glusterfsd, so I had to kill
both and restart glusterd. That solved the overload on glusterfsd and
performance is back to near normal. I'm now doing a rebalance/fix-layout
which is running as expected, but will take the weekend to complete.  I did
notice that the affected node (pbs3) has more files than the others, tho
I'm not sure that this is significant.

Filesystem       Size  Used Avail Use% Mounted on
pbs1:/dev/sdb    6.4T  1.9T  4.6T  29% /bducgl
pbs2:/dev/md0    8.2T  2.4T  5.9T  30% /bducgl
pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
pbs4:/dev/sda    6.4T  1.8T  4.6T  29% /bducgl


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120810/e4fcf846/attachment.html>

Nux!

2012-Aug-11 11:11 UTC

head link

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

On 10.08.2012 22:16, Harry Mangalam wrote:> pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
Harry,

The name of that md device (127) indicated there may be something dodgy 
going on there. A device shouldn't be named 127 unless some problems 
occured. Are you sure your drives are OK?

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Joe Julian

2012-Aug-11 16:56 UTC

head link

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

Check your client logs. I have seen that with network issues causing
disconnects.

Harry Mangalam <hjmangalam at gmail.com> wrote:
>Thanks for your comments.
>
>I use mdadm on many servers and I've seen md numbering like this a fair
>bit. Usually it occurs after a another RAID has been created and the
>numbering shifts.  Neil Brown (mdadm's author) , seems to think it's
fine.
> So I don't think that's the problem.  And you're right - this
is a
>Frankengluster made from a variety of chassis and controllers and normally
>it's fine.   As Brian noted, it's all the same to gluster, mod some
small
>local differences in IO performance.
>
>Re the size difference, I'll explicitly rebalance the brick after the
>fix-layout finishes, but I'm even more worried about this fantastic
>increase in CPU usage and its effect on user performance.
>
>In the fix-layout routines (still running), I've seen CPU usage of
>glusterfsd rise to ~400% and loadavg go up to >15 on all the servers
>(except the pbs3, the one that originally had that problem).  That high
>load does not last long tho (maybe a few mintes - we've just installed
>nagios on these nodes and I'm getting a ton of emails about load
increasing
>and then decreasing on all the nodes (except pbs3).  When the load goes
>very high on a server node, the user-end performance drops appreciably.
>
>hjm
>
>
>
>On Sat, Aug 11, 2012 at 4:20 AM, Brian Candler <B.Candler at
pobox.com> wrote:
>
>> On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote:
>> > On 10.08.2012 22:16, Harry Mangalam wrote:
>> > >pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
>> >
>> > Harry,
>> >
>> > The name of that md device (127) indicated there may be something
>> > dodgy going on there. A device shouldn't be named 127 unless
some
>> > problems occured. Are you sure your drives are OK?
>>
>> I have systems with /dev/md127 all the time, and there's no
problem. It
>> seems to number downwards from /dev/md127 - if I create md array on the
>> same
>> system it is /dev/md126.
>>
>> However, this does suggest that the nodes are not configured
identically:
>> two are /dev/sda or /dev/sdb, which suggests either plain disk or
hardware
>> RAID, while two are /dev/md0 or /dev/127, which is software RAID.
>>
>> Although this could explain performance differences between the nodes,
this
>> is transparent to gluster and doesn't explain why the files are
unevenly
>> balanced - unless there is one huge file which happens to have been
>> allocated to this node.
>>
>> Regards,
>>
>> Brian.
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>
>
>
>-- 
>Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
>[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
>415 South Circle View Dr, Irvine, CA, 92697 [shipping]
>MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Gluster users - Aug 2012 - 1/4 glusterfsd's runs amok; performance suffers;

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;