thr3ads.net - Gluster users - [Gluster-users] Extremely high CPU load on one server [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Dan Bretherton

2013-Jan-31 15:02 UTC

[Gluster-users] Extremely high CPU load on one server

Dear All-
I originally asked this question in another thread, but as it is a 
separate problem I thought it deserved its own thread.

I had to extend two volumes recently and the layout fixes for both of 
them are now running at the same time.  One server now has a load of 
over 70 most of the time (mostly glusterfsd), but none of the others 
seem to be particularly busy.  I restarted the server in question but 
the CPU load quickly went up to 70 again.  I can't see any particular 
reason why this one server should be so badly affected by the layout 
fixing processes.  It isn't a particularly big server, with only five 
3TB bricks involved in the two volumes that were extended.  One 
possibility is that a lot of batch jobs on our compute cluster are 
accessing the same set of files that just happen to be on this one 
server, which could potentially happen because I have not been able to 
rebalance the files in the storage volumes for a long time after the 
last attempt resulted in data corruption (in GlusterFS version 3.2).  
However, poorly distributed files certainly isn't the whole story, 
because even when there is not much running on the compute cluster the 
storage server load remains extremely high.

Can anyone suggest a way to troubleshoot this problem?  The rebalance 
logs don't show anything unusual but glustershd.log has a lot of 
metadata split-brain warnings.   The brick logs are full of scary 
looking warnings but none flagged 'E' or 'C'.  The trouble is
that I see
messages like these on all the servers, and I can find nothing unusual 
about the server with a CPU load of 70.  Users are complaining about 
very poor performance, which has been going on or several weeks, so I 
must at least find a work-around that allows people to work normally.

-Dan.

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre (ESSC)
Department of Meteorology
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 7BE (or RG6 6AL for postal service deliveries)
UK
Tel. +44 118 378 5205, Fax: +44 118 378 6413

Adrià García-Alzórriz

2013-Jan-31 15:28 UTC

head link

[Gluster-users] Extremely high CPU load on one server

El dia Thu, 31 Jan 2013 15:02:25 +0000, en/na Dan Bretherton va escriure:
> Can anyone suggest a way to troubleshoot this problem?  The rebalance
> logs don't show anything unusual but glustershd.log has a lot of
> metadata split-brain warnings.   The brick logs are full of scary
> looking warnings but none flagged 'E' or 'C'.  The trouble
is that I see
> messages like these on all the servers, and I can find nothing unusual
> about the server with a CPU load of 70.  Users are complaining about
> very poor performance, which has been going on or several weeks, so I
> must at least find a work-around that allows people to work normally.
> 
> -Dan.
What's the output for your gluster volume info command?
What kind of disks are you using for?
How is the wait parameter in top?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130131/6bf7053f/attachment.sig>

Dan Bretherton

2013-Jan-31 18:46 UTC

head link

[Gluster-users] Extremely high CPU load on one server

On 01/31/2013 05:37 PM, gluster-users-request at gluster.org
wrote:> Date: Thu, 31 Jan 2013 15:28:25 +0000 (UTC)
> From: Adri? Garc?a-Alz?rriz 	<adria.garcia-alzorriz at adam.es>
> To: gluster-users at gluster.org
> Subject: Re: [Gluster-users] Extremely high CPU load on one server
> Message-ID: <kee2ip$osk$1 at ger.gmane.org>
> Content-Type: text/plain; charset="us-ascii"
>
> El dia Thu, 31 Jan 2013 15:02:25 +0000, en/na Dan Bretherton va escriure:
>> Can anyone suggest a way to troubleshoot this problem?  The rebalance
>> logs don't show anything unusual but glustershd.log has a lot of
>> metadata split-brain warnings.   The brick logs are full of scary
>> looking warnings but none flagged 'E' or 'C'.  The
trouble is that I see
>> messages like these on all the servers, and I can find nothing unusual
>> about the server with a CPU load of 70.  Users are complaining about
>> very poor performance, which has been going on or several weeks, so I
>> must at least find a work-around that allows people to work normally.
>>
>> -Dan.
> What's the output for your gluster volume info command?
> What kind of disks are you using for?
> How is the wait parameter in top?Hello- Here is the information you requested; thanks.
> What's the output for your gluster volume info command?There are ten volumes in the cluster, and the one I'm having the most 
trouble with is shown below.

[root at bdan11 ~]# gluster volume info atmos

Volume Name: atmos
Type: Distributed-Replicate
Volume ID: a4a7774e-cf4d-4a3e-9477-5c3d1b0efd93
Status: Started
Number of Bricks: 16 x 2 = 32
Transport-type: tcp
Bricks:
Brick1: bdan0.nerc-essc.ac.uk:/local/glusterfs
Brick2: bdan1.nerc-essc.ac.uk:/local/glusterfs
Brick3: bdan2.nerc-essc.ac.uk:/local/glusterfs
Brick4: bdan3.nerc-essc.ac.uk:/local/glusterfs
Brick5: bdan4.nerc-essc.ac.uk:/atmos/glusterfs
Brick6: bdan5.nerc-essc.ac.uk:/atmos/glusterfs
Brick7: bdan6.nerc-essc.ac.uk:/local/glusterfs
Brick8: bdan7.nerc-essc.ac.uk:/local/glusterfs
Brick9: bdan8.nerc-essc.ac.uk:/atmos/glusterfs
Brick10: bdan9.nerc-essc.ac.uk:/atmos/glusterfs
Brick11: bdan10.nerc-essc.ac.uk:/atmos/glusterfs
Brick12: bdan11.nerc-essc.ac.uk:/atmos/glusterfs
Brick13: bdan12.nerc-essc.ac.uk:/atmos/glusterfs
Brick14: bdan13.nerc-essc.ac.uk:/atmos/glusterfs
Brick15: bdan10.nerc-essc.ac.uk:/atmos2/glusterfs
Brick16: bdan11.nerc-essc.ac.uk:/atmos2/glusterfs
Brick17: bdan12.nerc-essc.ac.uk:/atmos2/glusterfs
Brick18: bdan13.nerc-essc.ac.uk:/atmos2/glusterfs
Brick19: bdan4.nerc-essc.ac.uk:/atmos2/glusterfs
Brick20: bdan5.nerc-essc.ac.uk:/atmos2/glusterfs
Brick21: bdan12.nerc-essc.ac.uk:/atmos3/glusterfs
Brick22: bdan13.nerc-essc.ac.uk:/atmos3/glusterfs
Brick23: bdan12.nerc-essc.ac.uk:/atmos4/glusterfs
Brick24: bdan13.nerc-essc.ac.uk:/atmos4/glusterfs
Brick25: bdan12.nerc-essc.ac.uk:/atmos5/glusterfs
Brick26: bdan13.nerc-essc.ac.uk:/atmos5/glusterfs
Brick27: pegasus.nerc-essc.ac.uk:/atmos/glusterfs
Brick28: bdan14.nerc-essc.ac.uk:/atmos/glusterfs
Brick29: bdan15.nerc-essc.ac.uk:/atmos/glusterfs
Brick30: bdan16.nerc-essc.ac.uk:/atmos/glusterfs
Brick31: bdan15.nerc-essc.ac.uk:/atmos2/glusterfs
Brick32: bdan16.nerc-essc.ac.uk:/atmos2/glusterfs
Options Reconfigured:
nfs.enable-ino32: on
server.allow-insecure: on
performance.stat-prefetch: off
performance.quick-read: off
cluster.min-free-disk: 338GB
nfs.rpc-auth-allow: 192.171.166.*,134.225.100.*
features.quota: off
> What kind of disks are you using for?They are Hitachi Deskstar 7200 RPM 2TB SATA drives, attached to an Areca 
ARC1880 RAID controller.  I'm aware that it's an unusual choice of drive
but the server vendor swears by them.  I have another pair of servers 
from the same vendor with 1TB Hitachi Deskstars and they have not given 
any GlusterFS related trouble.

The bricks are logical volumes on top of RAID-6 provided by the Areca.  
On this particular server they are formatted ext4, but I recently 
started using XFS for more recently added bricks on newer servers.  The 
bricks vary in size from 3.3TB to 500GB, depending on which volume they 
belong to.  I use smaller bricks for smaller volumes so that the brick 
size is an appropriate increment of volume growth.
> How is the wait parameter in top?There isn't much I/O wait as far as I can gather.  Here is a view 
showing the top few processes.

top - 18:22:58 up  1:59,  1 user,  load average: 64.25, 64.36, 64.14
Tasks: 218 total,   9 running, 209 sleeping,   0 stopped,   0 zombie
Cpu(s): 78.2%us, 20.7%sy,  0.0%ni,  0.9%id,  0.0%wa,  0.0%hi, 0.1%si,  
0.0%st
Mem:  12290244k total,  6639128k used,  5651116k free,  2752944k buffers
Swap: 14352376k total,        0k used, 14352376k free,  1441712k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
  3661 root      15   0  289m  42m 2168 R 85.0  0.4 101:54.33 glusterfsd
  3697 root      15   0  304m  37m 2112 S 83.1  0.3  75:53.43 glusterfsd
  3655 root      15   0  289m  42m 2160 S 79.4  0.4  97:11.52 glusterfsd
  3685 root      15   0  550m  62m 2100 S 77.5  0.5 109:07.53 glusterfsd
  3691 root      15   0  483m  40m 2100 S 75.6  0.3  76:10.93 glusterfsd
  3679 root      15   0  272m  20m 2124 R  1.9  0.2   4:41.69 glusterfsd
     1 root      15   0 10368  692  580 S  0.0  0.0   0:00.74 init

It's having a breather; the load has gone down to 64!  You can see more 
load information on my Ganglia web frontend, here:

http://lovejoy.nerc-essc.ac.uk/ganglia/?c=ESSC%20Storage%20Cluster

The five glusterfsd processes using the most resources, listed in top 
above, are for bricks belonging to two different volumes, so I can't 
blame just one user or group for the load; each research group in the 
department has their own volume. That probably disproves my theory that 
poorly distributed files is the cause of the problem. There are bricks 
belonging to six different volumes on this server.

One thing I do notice about these five glusterfsd processes is that they 
all belong to the two volumes that are having rebalance...fix-layout 
performed on them following the addition of new bricks recently.  The 
new bricks are on a different pair of servers.  This does seem to point 
the finger at the fix-layout as the cause of the high load, but why 
would only one server be affected in this way?

-Dan

Gluster users - Jan 2013 - Extremely high CPU load on one server

[Gluster-users] Extremely high CPU load on one server

[Gluster-users] Extremely high CPU load on one server

[Gluster-users] Extremely high CPU load on one server