thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] heal hanging [Jan 2016]

If this information is useful, please help other people find it:
Share via:

David Robinson

2016-Jan-25 17:40 UTC

[Gluster-users] [Gluster-devel] heal hanging

It is doing it again... statedump from gfs02a is attached...



------ Original Message ------
From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
To: "Glomski, Patrick" <patrick.glomski at corvidtec.com>
Cc: "David Robinson" <drobinson at corvidtec.com>; 
"gluster-users at gluster.org" <gluster-users at gluster.org>;
"Gluster Devel"
<gluster-devel at gluster.org>
Sent: 1/24/2016 9:27:02 PM
Subject: Re: [Gluster-users] [Gluster-devel] heal hanging
>It seems like there is a lot of finodelk/inodelk traffic. I wonder why 
>that is. I think the next steps is to collect statedump of the brick 
>which is taking lot of CPU, using "gluster volume statedump
<volname>"
>
>Pranith
>On 01/22/2016 08:36 AM, Glomski, Patrick wrote:
>>Pranith, attached are stack traces collected every second for 20 
>>seconds from the high-%cpu glusterfsd process.
>>
>>Patrick
>>
>>On Thu, Jan 21, 2016 at 9:46 PM, Glomski, Patrick 
>><patrick.glomski at corvidtec.com> wrote:
>>>Last entry for get_real_filename on any of the bricks was when we 
>>>turned off the samba gfapi vfs plugin earlier today:
>>>
>>>/var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21 
>>>15:13:00.008239] E [server-rpc-fops.c:768:server_getxattr_cbk] 
>>>0-homegfs-server: 105: GETXATTR /wks_backup 
>>>(40e582d6-b0c7-4099-ba88-9168a3c32ca6) 
>>>(glusterfs.get_real_filename:desktop.ini) ==> (Permission denied)
>>>
>>>We'll get back to you with those traces when %cpu spikes again.
As
>>>with most sporadic problems, as soon as you want something out of
it,
>>>the issue becomes harder to reproduce.
>>>
>>>
>>>On Thu, Jan 21, 2016 at 9:21 PM, Pranith Kumar Karampuri 
>>><pkarampu at redhat.com> wrote:
>>>>
>>>>
>>>>On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
>>>>>Unfortunately, all samba mounts to the gluster volume
through the
>>>>>gfapi vfs plugin have been disabled for the last 6 hours or
so and
>>>>>frequency of %cpu spikes is increased. We had switched to
sharing a
>>>>>fuse mount through samba, but I just disabled that as well.
There
>>>>>are no samba shares of this volume now. The spikes now
happen every
>>>>>thirty minutes or so. We've resorted to just rebooting
the machine
>>>>>with high load for the present.
>>>>
>>>>Could you see if the logs of following type are not at all
coming?
>>>>[2016-01-21 15:13:00.005736] E 
>>>>[server-rpc-fops.c:768:server_getxattr_cbk] 0-homegfs-server:
110:
>>>>GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c
>>>>32ca6) (glusterfs.get_real_filename:desktop.ini) ==>
(Permission
>>>>denied)
>>>>
>>>>These are operations that failed. Operations that succeed are
the
>>>>ones that will scan the directory. But I don't have a way to
find
>>>>them other than using tcpdumps.
>>>>
>>>>At the moment I have 2 theories:
>>>>1) these get_real_filename calls
>>>>2) [2016-01-21 16:10:38.017828] E
[server-helpers.c:46:gid_resolve]
>>>>0-gid-cache: getpwuid_r(494) failed
>>>>"
>>>>Yessir they are.  Normally, sssd would look to the local cache
file
>>>>in /var/lib/sss/db/ first, to get any group or userid
information,
>>>>then go out to the domain controller.  I put the options that we
are
>>>>using on our GFS volumes below?  Thanks for your help.
>>>>
>>>>
>>>>
>>>>We had been running sssd with sssd_nss and sssd_be sub-processes
on
>>>>these systems for a long time, under the GFS 3.5.2 code, and not
run
>>>>into the problem that David described with the high cpu usage on
>>>>sssd_nss.
>>>>
>>>>"
>>>>That was Tom Young's email 1.5 years back when we debugged
it. But
>>>>the process which was consuming lot of cpu is sssd_nss. So I am
not
>>>>sure if it is same issue. Let us debug to see '1)'
doesn't happen.
>>>>The gstack traces I asked for should also help.
>>>>
>>>>
>>>>Pranith
>>>>>
>>>>>On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri 
>>>>><pkarampu at redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
>>>>>>>We use the samba glusterfs virtual filesystem (the
current
>>>>>>>version provided on download.gluster.org), but no
windows clients
>>>>>>>connecting directly.
>>>>>>
>>>>>>Hmm.. Is there a way to disable using this and check if
the CPU%
>>>>>>still increases? What getxattr of
"glusterfs.get_real_filename
>>>>>><filanme>" does is to scan the entire
directory looking for
>>>>>>strcasecmp(<filname>, <scanned-filename>).
If anything matches
>>>>>>then it will return the <scanned-filename>. But
the problem is the
>>>>>>scan is costly. So I wonder if this is the reason for
the CPU
>>>>>>spikes.
>>>>>>
>>>>>>Pranith
>>>>>>
>>>>>>>
>>>>>>>On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar
Karampuri
>>>>>>><pkarampu at redhat.com> wrote:
>>>>>>>>Do you have any windows clients? I see a lot of
getxattr calls
>>>>>>>>for "glusterfs.get_real_filename"
which lead to full readdirs of
>>>>>>>>the directories on the brick.
>>>>>>>>
>>>>>>>>Pranith
>>>>>>>>
>>>>>>>>On 01/22/2016 12:51 AM, Glomski, Patrick wrote:
>>>>>>>>>Pranith, could this kind of behavior be
self-inflicted by us
>>>>>>>>>deleting files directly from the bricks? We
have done that in
>>>>>>>>>the past to clean up an issues where gluster
wouldn't allow us
>>>>>>>>>to delete from the mount.
>>>>>>>>>
>>>>>>>>>If so, is it feasible to clean them up by
running a search on
>>>>>>>>>the .glusterfs directories directly and
removing files with a
>>>>>>>>>reference count of 1 that are non-zero size
(or directly
>>>>>>>>>checking the xattrs to be sure that it's
not a DHT link).
>>>>>>>>>
>>>>>>>>>find /data/brick01a/homegfs/.glusterfs -type
f -not -empty
>>>>>>>>>-links -2 -exec rm -f "{}" \;
>>>>>>>>>
>>>>>>>>>Is there anything I'm inherently missing
with that approach
>>>>>>>>>that will further corrupt the system?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>On Thu, Jan 21, 2016 at 1:02 PM, Glomski,
Patrick
>>>>>>>>><patrick.glomski at corvidtec.com>
wrote:
>>>>>>>>>>Load spiked again: ~1200%cpu on gfs02a
for glusterfsd. Crawl
>>>>>>>>>>has been running on one of the bricks on
gfs02b for 25 min or
>>>>>>>>>>so and users cannot access the volume.
>>>>>>>>>>
>>>>>>>>>>I re-listed the xattrop directories as
well as a 'top' entry
>>>>>>>>>>and heal statistics. Then I restarted
the gluster services on
>>>>>>>>>>gfs02a.
>>>>>>>>>>
>>>>>>>>>>=================== top
==================>>>>>>>>>>PID USER      PR  NI 
VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>>>>>>>COMMAND
>>>>>>>>>>  8969 root      20   0 2815m 204m 3588
S 1181.0  0.6 591:06.93
>>>>>>>>>>glusterfsd
>>>>>>>>>>
>>>>>>>>>>=================== xattrop
==================>>>>>>>>>>/data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-41f19453-91e4-437c-afa9-3b25614de210
>>>>>>>>>>xattrop-9b815879-2f4d-402b-867c-a6d65087788c
>>>>>>>>>>
>>>>>>>>>>/data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-70131855-3cfb-49af-abce-9d23f57fb393
>>>>>>>>>>xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
>>>>>>>>>>
>>>>>>>>>>/data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>e6e47ed9-309b-42a7-8c44-28c29b9a20f8
>>>>>>>>>>xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
>>>>>>>>>>xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
>>>>>>>>>>xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
>>>>>>>>>>
>>>>>>>>>>/data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
>>>>>>>>>>xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
>>>>>>>>>>
>>>>>>>>>>/data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
>>>>>>>>>>
>>>>>>>>>>/data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
>>>>>>>>>>
>>>>>>>>>>/data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>8034bc06-92cd-4fa5-8aaf-09039e79d2c8  
>>>>>>>>>>c9ce22ed-6d8b-471b-a111-b39e57f0b512
>>>>>>>>>>94fa1d60-45ad-4341-b69c-315936b51e8d  
>>>>>>>>>>xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
>>>>>>>>>>
>>>>>>>>>>/data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>>>>>>>>xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>=================== heal stats
==================>>>>>>>>>>
>>>>>>>>>>homegfs [b0-gfsib01a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:45 2016
>>>>>>>>>>homegfs [b0-gfsib01a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:45 2016
>>>>>>>>>>homegfs [b0-gfsib01a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b1-gfsib01b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:19 2016
>>>>>>>>>>homegfs [b1-gfsib01b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:19 2016
>>>>>>>>>>homegfs [b1-gfsib01b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of heal
failed entries   : 1
>>>>>>>>>>
>>>>>>>>>>homegfs [b2-gfsib01a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:48 2016
>>>>>>>>>>homegfs [b2-gfsib01a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:48 2016
>>>>>>>>>>homegfs [b2-gfsib01a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b3-gfsib01b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:47 2016
>>>>>>>>>>homegfs [b3-gfsib01b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:47 2016
>>>>>>>>>>homegfs [b3-gfsib01b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b4-gfsib02a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:06 2016
>>>>>>>>>>homegfs [b4-gfsib02a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:06 2016
>>>>>>>>>>homegfs [b4-gfsib02a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b5-gfsib02b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:13:40 2016
>>>>>>>>>>homegfs [b5-gfsib02b] :                 
***
>>>>>>>>>>Crawl is in progress ***
>>>>>>>>>>homegfs [b5-gfsib02b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b6-gfsib02a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:58 2016
>>>>>>>>>>homegfs [b6-gfsib02a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:58 2016
>>>>>>>>>>homegfs [b6-gfsib02a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b7-gfsib02b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:36:50 2016
>>>>>>>>>>homegfs [b7-gfsib02b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:36:50 2016
>>>>>>>>>>homegfs [b7-gfsib02b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>=======================================================================================>>>>>>>>>>I
waited a few minutes for the heals to finish and ran the
>>>>>>>>>>heal statistics and info again. one file
is in split-brain.
>>>>>>>>>>Aside from the split-brain, the load on
all systems is down
>>>>>>>>>>now and they are behaving normally.
glustershd.log is
>>>>>>>>>>attached. What is going on???
>>>>>>>>>>
>>>>>>>>>>Thu Jan 21 12:53:50 EST 2016
>>>>>>>>>>
>>>>>>>>>>=================== homegfs
==================>>>>>>>>>>
>>>>>>>>>>homegfs [b0-gfsib01a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:02 2016
>>>>>>>>>>homegfs [b0-gfsib01a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:02 2016
>>>>>>>>>>homegfs [b0-gfsib01a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b0-gfsib01a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b1-gfsib01b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:38 2016
>>>>>>>>>>homegfs [b1-gfsib01b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:38 2016
>>>>>>>>>>homegfs [b1-gfsib01b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b1-gfsib01b] : No. of heal
failed entries   : 1
>>>>>>>>>>
>>>>>>>>>>homegfs [b2-gfsib01a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b2-gfsib01a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b2-gfsib01a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b2-gfsib01a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b3-gfsib01b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b3-gfsib01b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b3-gfsib01b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b3-gfsib01b] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b4-gfsib02a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:33 2016
>>>>>>>>>>homegfs [b4-gfsib02a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:33 2016
>>>>>>>>>>homegfs [b4-gfsib02a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b4-gfsib02a] : No. of heal
failed entries   : 1
>>>>>>>>>>
>>>>>>>>>>homegfs [b5-gfsib02b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:14 2016
>>>>>>>>>>homegfs [b5-gfsib02b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:15 2016
>>>>>>>>>>homegfs [b5-gfsib02b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b5-gfsib02b] : No. of heal
failed entries   : 3
>>>>>>>>>>
>>>>>>>>>>homegfs [b6-gfsib02a] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b6-gfsib02a] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:04 2016
>>>>>>>>>>homegfs [b6-gfsib02a] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b6-gfsib02a] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>homegfs [b7-gfsib02b] : Starting time of
crawl       : Thu Jan
>>>>>>>>>>21 12:53:09 2016
>>>>>>>>>>homegfs [b7-gfsib02b] : Ending time of
crawl         : Thu Jan
>>>>>>>>>>21 12:53:09 2016
>>>>>>>>>>homegfs [b7-gfsib02b] : Type of crawl:
INDEX
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of entries
healed        : 0
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of entries
in split-brain: 0
>>>>>>>>>>homegfs [b7-gfsib02b] : No. of heal
failed entries   : 0
>>>>>>>>>>
>>>>>>>>>>*** gluster bug in 'gluster volume
heal homegfs statistics'
>>>>>>>>>>***
>>>>>>>>>>*** Use 'gluster volume heal homegfs
info' until bug is fixed
>>>>>>>>>>***
>>>>>>>>>>
>>>>>>>>>>Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>>>>/users/bangell/.gconfd - Is in
split-brain
>>>>>>>>>>
>>>>>>>>>>Number of entries: 1
>>>>>>>>>>
>>>>>>>>>>Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>>>>/users/bangell/.gconfd - Is in
split-brain
>>>>>>>>>>
>>>>>>>>>>/users/bangell/.gconfd/saved_state
>>>>>>>>>>Number of entries: 2
>>>>>>>>>>
>>>>>>>>>>Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>>>>Number of entries: 0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>On Thu, Jan 21, 2016 at 11:10 AM,
Pranith Kumar Karampuri
>>>>>>>>>><pkarampu at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>On 01/21/2016 09:26 PM, Glomski,
Patrick wrote:
>>>>>>>>>>>>I should mention that the
problem is not currently occurring
>>>>>>>>>>>>and there are no heals (output
appended). By restarting the
>>>>>>>>>>>>gluster services, we can stop
the crawl, which lowers the
>>>>>>>>>>>>load for a while. Subsequent
crawls seem to finish properly.
>>>>>>>>>>>>For what it's worth,
files/folders that show up in the
>>>>>>>>>>>>'volume info' output
during a hung crawl don't seem to be
>>>>>>>>>>>>anything out of the ordinary.
>>>>>>>>>>>>
>>>>>>>>>>>>Over the past four days, the
typical time before the problem
>>>>>>>>>>>>recurs after suppressing it in
this manner is an hour. Last
>>>>>>>>>>>>night when we reached out to you
was the last time it
>>>>>>>>>>>>happened and the load has been
low since (a relief).  David
>>>>>>>>>>>>believes that recursively
listing the files (ls -alR or
>>>>>>>>>>>>similar) from a client mount can
force the issue to happen,
>>>>>>>>>>>>but obviously I'd rather not
unless we have some precise
>>>>>>>>>>>>thing we're looking for. Let
me know if you'd like me to
>>>>>>>>>>>>attempt to drive the system
unstable like that and what I
>>>>>>>>>>>>should look for. As it's a
production system, I'd rather not
>>>>>>>>>>>>leave it in this state for long.
>>>>>>>>>>>
>>>>>>>>>>>Will it be possible to send
glustershd, mount logs of the
>>>>>>>>>>>past 4 days? I would like to see if
this is because of
>>>>>>>>>>>directory self-heal going wild (Ravi
is working on throttling
>>>>>>>>>>>feature for 3.8, which will allow to
put breaks on self-heal
>>>>>>>>>>>traffic)
>>>>>>>>>>>
>>>>>>>>>>>Pranith
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>[root at gfs01a xattrop]#
gluster volume heal homegfs info
>>>>>>>>>>>>Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>>>>>>Number of entries: 0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>On Thu, Jan 21, 2016 at 10:40
AM, Pranith Kumar Karampuri
>>>>>>>>>>>><pkarampu at redhat.com>
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>On 01/21/2016 08:25 PM,
Glomski, Patrick wrote:
>>>>>>>>>>>>>>Hello, Pranith. The
typical behavior is that the %cpu on a
>>>>>>>>>>>>>>glusterfsd process jumps
to number of processor cores
>>>>>>>>>>>>>>available (800% or
1200%, depending on the pair of nodes
>>>>>>>>>>>>>>involved) and the load
average on the machine goes very
>>>>>>>>>>>>>>high (~20). The
volume's heal statistics output shows that
>>>>>>>>>>>>>>it is crawling one of
the bricks and trying to heal, but
>>>>>>>>>>>>>>this crawl hangs and
never seems to finish.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>The number of files in
the xattrop directory varies over
>>>>>>>>>>>>>>time, so I ran a wc -l
as you requested periodically for
>>>>>>>>>>>>>>some time and then
started including a datestamped list of
>>>>>>>>>>>>>>the files that were in
the xattrops directory on each
>>>>>>>>>>>>>>brick to see which were
persistent. All bricks had files
>>>>>>>>>>>>>>in the xattrop folder,
so all results are attached.
>>>>>>>>>>>>>Thanks this info is helpful.
I don't see a lot of files.
>>>>>>>>>>>>>Could you give output of
"gluster volume heal <volname>
>>>>>>>>>>>>>info"? Is there any
directory in there which is LARGE?
>>>>>>>>>>>>>
>>>>>>>>>>>>>Pranith
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Please let me know if
there is anything else I can
>>>>>>>>>>>>>>provide.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Patrick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>On Thu, Jan 21, 2016 at
12:01 AM, Pranith Kumar Karampuri
>>>>>>>>>>>>>><pkarampu at
redhat.com> wrote:
>>>>>>>>>>>>>>>hey,
>>>>>>>>>>>>>>>        Which
process is consuming so much cpu? I went
>>>>>>>>>>>>>>>through the logs you
gave me. I see that the following
>>>>>>>>>>>>>>>files are in gfid
mismatch state:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>><066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>>>>>>>>>>>>>>><1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>>>>>>>>>>>>>>><ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Could you give me
the output of "ls
>>>>>>>>>>>>>>><brick-path>/indices/xattrop
| wc -l" output on all the
>>>>>>>>>>>>>>>bricks which are
acting this way? This will tell us the
>>>>>>>>>>>>>>>number of pending
self-heals on the system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Pranith
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>On 01/20/2016 09:26
PM, David Robinson wrote:
>>>>>>>>>>>>>>>>resending with
parsed logs...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>I am
having issues with 3.6.6 where the load will
>>>>>>>>>>>>>>>>>>spike up
to 800% for one of the glusterfsd processes
>>>>>>>>>>>>>>>>>>and the
users can no longer access the system.  If I
>>>>>>>>>>>>>>>>>>reboot
the node, the heal will finish normally after a
>>>>>>>>>>>>>>>>>>few
minutes and the system will be responsive, but a
>>>>>>>>>>>>>>>>>>few
hours later the issue will start again.  It look
>>>>>>>>>>>>>>>>>>like it
is hanging in a heal and spinning up the load
>>>>>>>>>>>>>>>>>>on one
of the bricks.  The heal gets stuck and says it
>>>>>>>>>>>>>>>>>>is
crawling and never returns.  After a few minutes of
>>>>>>>>>>>>>>>>>>the heal
saying it is crawling, the load spikes up and
>>>>>>>>>>>>>>>>>>the
mounts become unresponsive.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>Any
suggestions on how to fix this?  It has us stopped
>>>>>>>>>>>>>>>>>>cold as
the user can no longer access the systems when
>>>>>>>>>>>>>>>>>>the load
spikes... Logs attached.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>System
setup info is:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>[root at
gfs01a ~]# gluster volume info homegfs
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>Volume
Name: homegfs
>>>>>>>>>>>>>>>>>>Type:
Distributed-Replicate
>>>>>>>>>>>>>>>>>>Volume
ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>>>>>>>>>>>>Status:
Started
>>>>>>>>>>>>>>>>>>Number
of Bricks: 4 x 2 = 8
>>>>>>>>>>>>>>>>>>Transport-type:
tcp
>>>>>>>>>>>>>>>>>>Bricks:
>>>>>>>>>>>>>>>>>>Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>>>>>>>>>>>Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>>>>>>>>>>>Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>>>>>>>>>>>Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>>>>>>>>>>>Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>>>>>>>>>>>Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>>>>>>>>>>>Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>>>>>>>>>>>Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>>>>>>>>>>>Options
Reconfigured:
>>>>>>>>>>>>>>>>>>performance.io-thread-count:
32
>>>>>>>>>>>>>>>>>>performance.cache-size:
128MB
>>>>>>>>>>>>>>>>>>performance.write-behind-window-size:
128MB
>>>>>>>>>>>>>>>>>>server.allow-insecure:
on
>>>>>>>>>>>>>>>>>>network.ping-timeout:
42
>>>>>>>>>>>>>>>>>>storage.owner-gid:
100
>>>>>>>>>>>>>>>>>>geo-replication.indexing:
off
>>>>>>>>>>>>>>>>>>geo-replication.ignore-pid-check:
on
>>>>>>>>>>>>>>>>>>changelog.changelog:
off
>>>>>>>>>>>>>>>>>>changelog.fsync-interval:
3
>>>>>>>>>>>>>>>>>>changelog.rollover-time:
15
>>>>>>>>>>>>>>>>>>server.manage-gids:
on
>>>>>>>>>>>>>>>>>>diagnostics.client-log-level:
WARNING
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>[root at
gfs01a ~]# rpm -qa | grep gluster
>>>>>>>>>>>>>>>>>>gluster-nagios-common-0.1.1-0.el6.noarch
>>>>>>>>>>>>>>>>>>glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-libs-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-api-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-devel-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-cli-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-server-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>>>Gluster-devel
mailing list
>>>>>>>>>>>>>>>>Gluster-devel at
gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>>Gluster-users
mailing list
>>>>>>>>>>>>>>>Gluster-users at
gluster.org
>>>>>>>>>>>>>>>http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160125/4018f0a2/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data-brick02a-homegfs.4066.dump.1453742225.gz
Type: application/x-gzip
Size: 1138050 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160125/4018f0a2/attachment-0002.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data-brick01a-homegfs.4061.dump.1453742224.gz
Type: application/x-gzip
Size: 640151 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160125/4018f0a2/attachment-0003.gz>

Pranith Kumar Karampuri

2016-Jan-28 10:10 UTC

head link

[Gluster-users] [Gluster-devel] heal hanging

On 01/25/2016 11:10 PM, David Robinson wrote:> It is doing it again... statedump from gfs02a is attached...
David,
        I see a lot of traffic from [f]inodelks:
15:09:00 :) ? grep wind_from data-brick02a-homegfs.4066.dump.1453742225 
| sort | uniq -c
      11 unwind_from=default_finodelk_cbk
      11 unwind_from=io_stats_finodelk_cbk
      11 unwind_from=pl_common_inodelk
    1133 wind_from=default_finodelk_resume
       1 wind_from=default_inodelk_resume
      75 wind_from=index_getxattr
       6 wind_from=io_stats_entrylk
   12776 wind_from=io_stats_finodelk
      15 wind_from=io_stats_flush
      75 wind_from=io_stats_getxattr
       4 wind_from=io_stats_inodelk
       4 wind_from=io_stats_lk
       4 wind_from=io_stats_setattr
      75 wind_from=marker_getxattr
       4 wind_from=marker_setattr
      75 wind_from=quota_getxattr
       6 wind_from=server_entrylk_resume
   12776 wind_from=server_finodelk_resume <<--------------
      15 wind_from=server_flush_resume
      75 wind_from=server_getxattr_resume
       4 wind_from=server_inodelk_resume
       4 wind_from=server_lk_resume
       4 wind_from=server_setattr_resume

But very less number of active locks:
pk1 at localhost - ~/Downloads
15:09:07 :) ? grep ACTIVE data-brick02a-homegfs.4066.dump.1453742225
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, 
start=9223372036854775806, len=0, pid = 11678, owner=b42fff03ce7f0000, 
client=0x13d2cd0, 
connection-id=corvidpost3.corvidtec.com-52656-2016/01/22-16:40:31:459920-homegfs-client-6-0-1,
granted at 2016-01-25 17:16:06
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 
15759, owner=b8ca8c0100000000, client=0x189e470, 
connection-id=corvidpost4.corvidtec.com-17718-2016/01/22-16:40:31:221380-homegfs-client-6-0-1,
granted at 2016-01-25 17:12:52
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, 
start=9223372036854775806, len=0, pid = 7103, owner=0cf31a98f87f0000, 
client=0x2201d60, 
connection-id=zlv-bangell-4812-2016/01/25-13:45:52:170157-homegfs-client-6-0-0, 
granted at 2016-01-25 17:09:56
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, 
start=9223372036854775806, len=0, pid = 55764, owner=882dbea1417f0000, 
client=0x17fc940, 
connection-id=corvidpost.corvidtec.com-35961-2016/01/22-16:40:31:88946-homegfs-client-6-0-1,
granted at 2016-01-25 17:06:12
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, 
start=9223372036854775806, len=0, pid = 21129, owner=3cc068a1e07f0000, 
client=0x1495040, 
connection-id=corvidpost2.corvidtec.com-43400-2016/01/22-16:40:31:248771-homegfs-client-6-0-1,
granted at 2016-01-25 17:15:53

One more odd thing I found is the following:

[2016-01-15 14:03:06.910687] C 
[rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-homegfs-client-2: 
server 10.200.70.1:49153 has not responded in the last 10 seconds, 
disconnecting.
[2016-01-15 14:03:06.910886] E [rpc-clnt.c:362:saved_frames_unwind] (--> 
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x2b74c289a580] 
(--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x2b74c2b27787] 
(--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x2b74c2b2789e] 
(--> 
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x2b74c2b27951] 
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x2b74c2b27f1f] 
))))) 0-homegfs-client-2: forced unwinding frame type(GlusterFS 3.3) 
op(FINODELK(30)) called at 2016-01-15 10:30:09.487422 (xid=0x11ed3f)

FINODELK is called at 2016-01-15 10:30:09.487422 but the response still 
didn't come till 14:03:06. That is almost 3.5 hours!!

Something really bad related to locks is happening. Did you guys patch 
the recent memory corruption bug which only affects workloads with more 
than 128 clients? http://review.gluster.org/13241

Pranith> ------ Original Message ------
> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com 
> <mailto:pkarampu at redhat.com>>
> To: "Glomski, Patrick" <patrick.glomski at corvidtec.com 
> <mailto:patrick.glomski at corvidtec.com>>
> Cc: "David Robinson" <drobinson at corvidtec.com 
> <mailto:drobinson at corvidtec.com>>; "gluster-users at
gluster.org"
> <gluster-users at gluster.org <mailto:gluster-users at
gluster.org>>;
> "Gluster Devel" <gluster-devel at gluster.org 
> <mailto:gluster-devel at gluster.org>>
> Sent: 1/24/2016 9:27:02 PM
> Subject: Re: [Gluster-users] [Gluster-devel] heal hanging
>> It seems like there is a lot of finodelk/inodelk traffic. I wonder 
>> why that is. I think the next steps is to collect statedump of the 
>> brick which is taking lot of CPU, using "gluster volume statedump 
>> <volname>"
>>
>> Pranith
>> On 01/22/2016 08:36 AM, Glomski, Patrick wrote:
>>> Pranith, attached are stack traces collected every second for 20 
>>> seconds from the high-%cpu glusterfsd process.
>>>
>>> Patrick
>>>
>>> On Thu, Jan 21, 2016 at 9:46 PM, Glomski, Patrick 
>>> <patrick.glomski at corvidtec.com 
>>> <mailto:patrick.glomski at corvidtec.com>> wrote:
>>>
>>>     Last entry for get_real_filename on any of the bricks was when
>>>     we turned off the samba gfapi vfs plugin earlier today:
>>>
>>>     /var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21
>>>     15:13:00.008239] E [server-rpc-fops.c:768:server_getxattr_cbk]
>>>     0-homegfs-server: 105: GETXATTR /wks_backup
>>>     (40e582d6-b0c7-4099-ba88-9168a3c32ca6)
>>>     (glusterfs.get_real_filename:desktop.ini) ==> (Permission
denied)
>>>
>>>     We'll get back to you with those traces when %cpu spikes
again.
>>>     As with most sporadic problems, as soon as you want something
>>>     out of it, the issue becomes harder to reproduce.
>>>
>>>
>>>     On Thu, Jan 21, 2016 at 9:21 PM, Pranith Kumar Karampuri
>>>     <pkarampu at redhat.com <mailto:pkarampu at
redhat.com>> wrote:
>>>
>>>
>>>
>>>         On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
>>>>         Unfortunately, all samba mounts to the gluster volume
>>>>         through the gfapi vfs plugin have been disabled for the
>>>>         last 6 hours or so and frequency of %cpu spikes is
>>>>         increased. We had switched to sharing a fuse mount
through
>>>>         samba, but I just disabled that as well. There are no
samba
>>>>         shares of this volume now. The spikes now happen every
>>>>         thirty minutes or so. We've resorted to just
rebooting the
>>>>         machine with high load for the present.
>>>
>>>         Could you see if the logs of following type are not at all
>>>         coming?
>>>         [2016-01-21 15:13:00.005736] E
>>>         [server-rpc-fops.c:768:server_getxattr_cbk]
>>>         0-homegfs-server: 110: GETXATTR /wks_backup
>>>         (40e582d6-b0c7-4099-ba88-9168a3c
>>>         32ca6) (glusterfs.get_real_filename:desktop.ini) ==>
>>>         (Permission denied)
>>>
>>>         These are operations that failed. Operations that succeed
>>>         are the ones that will scan the directory. But I don't
have
>>>         a way to find them other than using tcpdumps.
>>>
>>>         At the moment I have 2 theories:
>>>         1) these get_real_filename calls
>>>         2) [2016-01-21 16:10:38.017828] E
>>>         [server-helpers.c:46:gid_resolve] 0-gid-cache:
>>>         getpwuid_r(494) failed
>>>         "
>>>
>>>         Yessir they are. Normally, sssd would look to the local
>>>         cache file in /var/lib/sss/db/ first, to get any group or
>>>         userid information, then go out to the domain controller. 
I
>>>         put the options that we are using on our GFS volumes below?
>>>         Thanks for your help.
>>>
>>>         We had been running sssd with sssd_nss and sssd_be
>>>         sub-processes on these systems for a long time, under the
>>>         GFS 3.5.2 code, and not run into the problem that David
>>>         described with the high cpu usage on sssd_nss.
>>>
>>>         *"
>>>         *That was Tom Young's email 1.5 years back when we
debugged
>>>         it. But the process which was consuming lot of cpu is
>>>         sssd_nss. So I am not sure if it is same issue. Let us
debug
>>>         to see '1)' doesn't happen. The gstack traces I
asked for
>>>         should also help.
>>>
>>>
>>>         Pranith
>>>>
>>>>         On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar
Karampuri
>>>>         <pkarampu at redhat.com <mailto:pkarampu at
redhat.com>> wrote:
>>>>
>>>>
>>>>
>>>>             On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
>>>>>             We use the samba glusterfs virtual filesystem
(the
>>>>>             current version provided on
download.gluster.org
>>>>>             <http://download.gluster.org/>), but no
windows
>>>>>             clients connecting directly.
>>>>
>>>>             Hmm.. Is there a way to disable using this and
check if
>>>>             the CPU% still increases? What getxattr of
>>>>             "glusterfs.get_real_filename
<filanme>" does is to scan
>>>>             the entire directory looking for
strcasecmp(<filname>,
>>>>             <scanned-filename>). If anything matches then
it will
>>>>             return the <scanned-filename>. But the
problem is the
>>>>             scan is costly. So I wonder if this is the reason
for
>>>>             the CPU spikes.
>>>>
>>>>             Pranith
>>>>
>>>>>
>>>>>             On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar
>>>>>             Karampuri <pkarampu at redhat.com
>>>>>             <mailto:pkarampu at redhat.com>>
wrote:
>>>>>
>>>>>                 Do you have any windows clients? I see a
lot of
>>>>>                 getxattr calls for
"glusterfs.get_real_filename"
>>>>>                 which lead to full readdirs of the
directories on
>>>>>                 the brick.
>>>>>
>>>>>                 Pranith
>>>>>
>>>>>                 On 01/22/2016 12:51 AM, Glomski, Patrick
wrote:
>>>>>>                 Pranith, could this kind of behavior be
>>>>>>                 self-inflicted by us deleting files
directly from
>>>>>>                 the bricks? We have done that in the
past to
>>>>>>                 clean up an issues where gluster
wouldn't allow
>>>>>>                 us to delete from the mount.
>>>>>>
>>>>>>                 If so, is it feasible to clean them up
by running
>>>>>>                 a search on the .glusterfs directories
directly
>>>>>>                 and removing files with a reference
count of 1
>>>>>>                 that are non-zero size (or directly
checking the
>>>>>>                 xattrs to be sure that it's not a
DHT link).
>>>>>>
>>>>>>                 find /data/brick01a/homegfs/.glusterfs
-type f
>>>>>>                 -not -empty -links -2 -exec rm -f
"{}" \;
>>>>>>
>>>>>>                 Is there anything I'm inherently
missing with
>>>>>>                 that approach that will further corrupt
the system?
>>>>>>
>>>>>>
>>>>>>                 On Thu, Jan 21, 2016 at 1:02 PM,
Glomski, Patrick
>>>>>>                 <patrick.glomski at corvidtec.com
>>>>>>                 <mailto:patrick.glomski at
corvidtec.com>> wrote:
>>>>>>
>>>>>>                     Load spiked again: ~1200%cpu on
gfs02a for
>>>>>>                     glusterfsd. Crawl has been running
on one of
>>>>>>                     the bricks on gfs02b for 25 min or
so and
>>>>>>                     users cannot access the volume.
>>>>>>
>>>>>>                     I re-listed the xattrop directories
as well
>>>>>>                     as a 'top' entry and heal
statistics. Then I
>>>>>>                     restarted the gluster services on
gfs02a.
>>>>>>
>>>>>>                     =================== top
==================>>>>>>                     PID USER PR  NI 
VIRT RES  SHR S %CPU %MEM
>>>>>>                     TIME+ COMMAND
>>>>>>                      8969 root      20 0 2815m 204m
3588 S 1181.0
>>>>>>                     0.6 591:06.93 glusterfsd
>>>>>>
>>>>>>                     =================== xattrop
==================>>>>>>                    
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-41f19453-91e4-437c-afa9-3b25614de210
>>>>>>                    
xattrop-9b815879-2f4d-402b-867c-a6d65087788c
>>>>>>
>>>>>>                    
/data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
>>>>>>                    
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
>>>>>>
>>>>>>                    
/data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
>>>>>>                    
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
>>>>>>                    
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
>>>>>>                    
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
>>>>>>
>>>>>>                    
/data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
>>>>>>                    
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
>>>>>>
>>>>>>                    
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
>>>>>>
>>>>>>                    
/data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
>>>>>>
>>>>>>                    
/data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
>>>>>>                    
c9ce22ed-6d8b-471b-a111-b39e57f0b512
>>>>>>                    
94fa1d60-45ad-4341-b69c-315936b51e8d
>>>>>>                    
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
>>>>>>
>>>>>>                    
/data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>>>>                    
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
>>>>>>
>>>>>>
>>>>>>                     =================== heal stats
>>>>>>                    
==================>>>>>>
>>>>>>                     homegfs [b0-gfsib01a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:45
2016
>>>>>>                     homegfs [b0-gfsib01a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:45 2016
>>>>>>                     homegfs [b0-gfsib01a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b0-gfsib01a] : No. of
entries healed : 0
>>>>>>                     homegfs [b0-gfsib01a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b0-gfsib01a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b1-gfsib01b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:19
2016
>>>>>>                     homegfs [b1-gfsib01b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:19 2016
>>>>>>                     homegfs [b1-gfsib01b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b1-gfsib01b] : No. of
entries healed : 0
>>>>>>                     homegfs [b1-gfsib01b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b1-gfsib01b] : No. of heal
failed
>>>>>>                     entries   : 1
>>>>>>
>>>>>>                     homegfs [b2-gfsib01a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:48
2016
>>>>>>                     homegfs [b2-gfsib01a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:48 2016
>>>>>>                     homegfs [b2-gfsib01a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b2-gfsib01a] : No. of
entries healed : 0
>>>>>>                     homegfs [b2-gfsib01a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b2-gfsib01a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b3-gfsib01b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:47
2016
>>>>>>                     homegfs [b3-gfsib01b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:47 2016
>>>>>>                     homegfs [b3-gfsib01b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b3-gfsib01b] : No. of
entries healed : 0
>>>>>>                     homegfs [b3-gfsib01b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b3-gfsib01b] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b4-gfsib02a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:06
2016
>>>>>>                     homegfs [b4-gfsib02a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:06 2016
>>>>>>                     homegfs [b4-gfsib02a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b4-gfsib02a] : No. of
entries healed : 0
>>>>>>                     homegfs [b4-gfsib02a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b4-gfsib02a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b5-gfsib02b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:13:40
2016
>>>>>>                     homegfs [b5-gfsib02b] : *** Crawl
is in
>>>>>>                     progress ***
>>>>>>                     homegfs [b5-gfsib02b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b5-gfsib02b] : No. of
entries healed : 0
>>>>>>                     homegfs [b5-gfsib02b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b5-gfsib02b] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b6-gfsib02a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:58
2016
>>>>>>                     homegfs [b6-gfsib02a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:58 2016
>>>>>>                     homegfs [b6-gfsib02a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b6-gfsib02a] : No. of
entries healed : 0
>>>>>>                     homegfs [b6-gfsib02a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b6-gfsib02a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b7-gfsib02b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:36:50
2016
>>>>>>                     homegfs [b7-gfsib02b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:36:50 2016
>>>>>>                     homegfs [b7-gfsib02b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b7-gfsib02b] : No. of
entries healed : 0
>>>>>>                     homegfs [b7-gfsib02b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b7-gfsib02b] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>
>>>>>>                    
=======================================================================================>>>>>>
I waited a few minutes for the heals to
>>>>>>                     finish and ran the heal statistics
and info
>>>>>>                     again. one file is in split-brain.
Aside from
>>>>>>                     the split-brain, the load on all
systems is
>>>>>>                     down now and they are behaving
normally.
>>>>>>                     glustershd.log is attached. What is
going on???
>>>>>>
>>>>>>                     Thu Jan 21 12:53:50 EST 2016
>>>>>>
>>>>>>                     =================== homegfs
==================>>>>>>
>>>>>>                     homegfs [b0-gfsib01a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:02
2016
>>>>>>                     homegfs [b0-gfsib01a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:02 2016
>>>>>>                     homegfs [b0-gfsib01a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b0-gfsib01a] : No. of
entries healed : 0
>>>>>>                     homegfs [b0-gfsib01a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b0-gfsib01a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b1-gfsib01b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:38
2016
>>>>>>                     homegfs [b1-gfsib01b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:38 2016
>>>>>>                     homegfs [b1-gfsib01b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b1-gfsib01b] : No. of
entries healed : 0
>>>>>>                     homegfs [b1-gfsib01b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b1-gfsib01b] : No. of heal
failed
>>>>>>                     entries   : 1
>>>>>>
>>>>>>                     homegfs [b2-gfsib01a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:04
2016
>>>>>>                     homegfs [b2-gfsib01a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:04 2016
>>>>>>                     homegfs [b2-gfsib01a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b2-gfsib01a] : No. of
entries healed : 0
>>>>>>                     homegfs [b2-gfsib01a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b2-gfsib01a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b3-gfsib01b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:04
2016
>>>>>>                     homegfs [b3-gfsib01b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:04 2016
>>>>>>                     homegfs [b3-gfsib01b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b3-gfsib01b] : No. of
entries healed : 0
>>>>>>                     homegfs [b3-gfsib01b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b3-gfsib01b] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b4-gfsib02a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:33
2016
>>>>>>                     homegfs [b4-gfsib02a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:33 2016
>>>>>>                     homegfs [b4-gfsib02a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b4-gfsib02a] : No. of
entries healed : 0
>>>>>>                     homegfs [b4-gfsib02a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b4-gfsib02a] : No. of heal
failed
>>>>>>                     entries   : 1
>>>>>>
>>>>>>                     homegfs [b5-gfsib02b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:14
2016
>>>>>>                     homegfs [b5-gfsib02b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:15 2016
>>>>>>                     homegfs [b5-gfsib02b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b5-gfsib02b] : No. of
entries healed : 0
>>>>>>                     homegfs [b5-gfsib02b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b5-gfsib02b] : No. of heal
failed
>>>>>>                     entries   : 3
>>>>>>
>>>>>>                     homegfs [b6-gfsib02a] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:04
2016
>>>>>>                     homegfs [b6-gfsib02a] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:04 2016
>>>>>>                     homegfs [b6-gfsib02a] : Type of
crawl: INDEX
>>>>>>                     homegfs [b6-gfsib02a] : No. of
entries healed : 0
>>>>>>                     homegfs [b6-gfsib02a] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b6-gfsib02a] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     homegfs [b7-gfsib02b] : Starting
time of
>>>>>>                     crawl       : Thu Jan 21 12:53:09
2016
>>>>>>                     homegfs [b7-gfsib02b] : Ending time
of crawl
>>>>>>                     : Thu Jan 21 12:53:09 2016
>>>>>>                     homegfs [b7-gfsib02b] : Type of
crawl: INDEX
>>>>>>                     homegfs [b7-gfsib02b] : No. of
entries healed : 0
>>>>>>                     homegfs [b7-gfsib02b] : No. of
entries in
>>>>>>                     split-brain: 0
>>>>>>                     homegfs [b7-gfsib02b] : No. of heal
failed
>>>>>>                     entries   : 0
>>>>>>
>>>>>>                     *** gluster bug in 'gluster
volume heal
>>>>>>                     homegfs statistics' ***
>>>>>>                     *** Use 'gluster volume heal
homegfs info'
>>>>>>                     until bug is fixed ***
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>                     /users/bangell/.gconfd - Is in
split-brain
>>>>>>
>>>>>>                     Number of entries: 1
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>                     /users/bangell/.gconfd - Is in
split-brain
>>>>>>
>>>>>>                     /users/bangell/.gconfd/saved_state
>>>>>>                     Number of entries: 2
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>                     Brick
>>>>>>                    
gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>                     Number of entries: 0
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                     On Thu, Jan 21, 2016 at 11:10 AM,
Pranith
>>>>>>                     Kumar Karampuri <pkarampu at
redhat.com
>>>>>>                     <mailto:pkarampu at
redhat.com>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>                         On 01/21/2016 09:26 PM,
Glomski, Patrick
>>>>>>                         wrote:
>>>>>>>                         I should mention that the
problem is not
>>>>>>>                         currently occurring and
there are no
>>>>>>>                         heals (output appended). By
restarting
>>>>>>>                         the gluster services, we
can stop the
>>>>>>>                         crawl, which lowers the
load for a
>>>>>>>                         while. Subsequent crawls
seem to finish
>>>>>>>                         properly. For what it's
worth,
>>>>>>>                         files/folders that show up
in the
>>>>>>>                         'volume info'
output during a hung crawl
>>>>>>>                         don't seem to be
anything out of the
>>>>>>>                         ordinary.
>>>>>>>
>>>>>>>                         Over the past four days,
the typical
>>>>>>>                         time before the problem
recurs after
>>>>>>>                         suppressing it in this
manner is an
>>>>>>>                         hour. Last night when we
reached out to
>>>>>>>                         you was the last time it
happened and
>>>>>>>                         the load has been low since
(a relief).
>>>>>>>                         David believes that
recursively listing
>>>>>>>                         the files (ls -alR or
similar) from a
>>>>>>>                         client mount can force the
issue to
>>>>>>>                         happen, but obviously
I'd rather not
>>>>>>>                         unless we have some precise
thing we're
>>>>>>>                         looking for. Let me know if
you'd like
>>>>>>>                         me to attempt to drive the
system
>>>>>>>                         unstable like that and what
I should
>>>>>>>                         look for. As it's a
production system,
>>>>>>>                         I'd rather not leave it
in this state
>>>>>>>                         for long.
>>>>>>
>>>>>>                         Will it be possible to send
glustershd,
>>>>>>                         mount logs of the past 4 days?
I would
>>>>>>                         like to see if this is because
of
>>>>>>                         directory self-heal going wild
(Ravi is
>>>>>>                         working on throttling feature
for 3.8,
>>>>>>                         which will allow to put breaks
on
>>>>>>                         self-heal traffic)
>>>>>>
>>>>>>                         Pranith
>>>>>>
>>>>>>>
>>>>>>>                         [root at gfs01a xattrop]#
gluster volume
>>>>>>>                         heal homegfs info
>>>>>>>                         Brick
>>>>>>>                        
gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>                         Brick
>>>>>>>                        
gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>>>>                         Number of entries: 0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                         On Thu, Jan 21, 2016 at
10:40 AM,
>>>>>>>                         Pranith Kumar Karampuri
>>>>>>>                         <pkarampu at redhat.com
>>>>>>>                         <mailto:pkarampu at
redhat.com>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                             On 01/21/2016 08:25 PM,
Glomski,
>>>>>>>                             Patrick wrote:
>>>>>>>>                             Hello, Pranith. The
typical
>>>>>>>>                             behavior is that
the %cpu on a
>>>>>>>>                             glusterfsd process
jumps to number
>>>>>>>>                             of processor cores
available (800%
>>>>>>>>                             or 1200%, depending
on the pair of
>>>>>>>>                             nodes involved) and
the load
>>>>>>>>                             average on the
machine goes very
>>>>>>>>                             high (~20). The
volume's heal
>>>>>>>>                             statistics output
shows that it is
>>>>>>>>                             crawling one of the
bricks and
>>>>>>>>                             trying to heal, but
this crawl
>>>>>>>>                             hangs and never
seems to finish.
>>>>>>>>
>>>>>>>>                             The number of files
in the xattrop
>>>>>>>>                             directory varies
over time, so I
>>>>>>>>                             ran a wc -l as you
requested
>>>>>>>>                             periodically for
some time and then
>>>>>>>>                             started including a
datestamped
>>>>>>>>                             list of the files
that were in the
>>>>>>>>                             xattrops directory
on each brick to
>>>>>>>>                             see which were
persistent. All
>>>>>>>>                             bricks had files in
the xattrop
>>>>>>>>                             folder, so all
results are attached.
>>>>>>>                             Thanks this info is
helpful. I don't
>>>>>>>                             see a lot of files.
Could you give
>>>>>>>                             output of "gluster
volume heal
>>>>>>>                             <volname>
info"? Is there any
>>>>>>>                             directory in there
which is LARGE?
>>>>>>>
>>>>>>>                             Pranith
>>>>>>>
>>>>>>>>
>>>>>>>>                             Please let me know
if there is
>>>>>>>>                             anything else I can
provide.
>>>>>>>>
>>>>>>>>                             Patrick
>>>>>>>>
>>>>>>>>
>>>>>>>>                             On Thu, Jan 21,
2016 at 12:01 AM,
>>>>>>>>                             Pranith Kumar
Karampuri
>>>>>>>>                             <pkarampu at
redhat.com
>>>>>>>>                             <mailto:pkarampu
at redhat.com>> wrote:
>>>>>>>>
>>>>>>>>                                 hey,
>>>>>>>>                                        Which
process is
>>>>>>>>                                 consuming so
much cpu? I went
>>>>>>>>                                 through the
logs you gave me. I
>>>>>>>>                                 see that the
following files
>>>>>>>>                                 are in gfid
mismatch state:
>>>>>>>>
>>>>>>>>                                
<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>>>>>>>>                                
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>>>>>>>>                                
<ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>>>>>>>>
>>>>>>>>                                 Could you give
me the output of
>>>>>>>>                                 "ls
>>>>>>>>                                
<brick-path>/indices/xattrop |
>>>>>>>>                                 wc -l"
output on all the bricks
>>>>>>>>                                 which are
acting this way? This
>>>>>>>>                                 will tell us
the number of
>>>>>>>>                                 pending
self-heals on the system.
>>>>>>>>
>>>>>>>>                                 Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>>                                 On 01/20/2016
09:26 PM, David
>>>>>>>>                                 Robinson wrote:
>>>>>>>>>                                 resending
with parsed logs...
>>>>>>>>>>>                                 I
am having issues with
>>>>>>>>>>>                                
3.6.6 where the load will
>>>>>>>>>>>                                
spike up to 800% for one of
>>>>>>>>>>>                                 the
glusterfsd processes and
>>>>>>>>>>>                                 the
users can no longer
>>>>>>>>>>>                                
access the system.  If I
>>>>>>>>>>>                                
reboot the node, the heal
>>>>>>>>>>>                                
will finish normally after a
>>>>>>>>>>>                                 few
minutes and the system
>>>>>>>>>>>                                
will be responsive, but a
>>>>>>>>>>>                                 few
hours later the issue
>>>>>>>>>>>                                
will start again. It look
>>>>>>>>>>>                                
like it is hanging in a heal
>>>>>>>>>>>                                 and
spinning up the load on
>>>>>>>>>>>                                 one
of the bricks.  The heal
>>>>>>>>>>>                                
gets stuck and says it is
>>>>>>>>>>>                                
crawling and never returns.
>>>>>>>>>>>                                
After a few minutes of the
>>>>>>>>>>>                                
heal saying it is crawling,
>>>>>>>>>>>                                 the
load spikes up and the
>>>>>>>>>>>                                
mounts become unresponsive.
>>>>>>>>>>>                                 Any
suggestions on how to
>>>>>>>>>>>                                 fix
this?  It has us stopped
>>>>>>>>>>>                                
cold as the user can no
>>>>>>>>>>>                                
longer access the systems
>>>>>>>>>>>                                
when the load spikes... Logs
>>>>>>>>>>>                                
attached.
>>>>>>>>>>>                                
System setup info is:
>>>>>>>>>>>                                
[root at gfs01a ~]# gluster
>>>>>>>>>>>                                
volume info homegfs
>>>>>>>>>>>
>>>>>>>>>>>                                
Volume Name: homegfs
>>>>>>>>>>>                                
Type: Distributed-Replicate
>>>>>>>>>>>                                
Volume ID:
>>>>>>>>>>>                                
1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>>>>>                                
Status: Started
>>>>>>>>>>>                                
Number of Bricks: 4 x 2 = 8
>>>>>>>>>>>                                
Transport-type: tcp
>>>>>>>>>>>                                
Bricks:
>>>>>>>>>>>                                
Brick1:
>>>>>>>>>>>                                
gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>>>>                                
Brick2:
>>>>>>>>>>>                                
gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>>>>                                
Brick3:
>>>>>>>>>>>                                
gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>>>>                                
Brick4:
>>>>>>>>>>>                                
gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>>>>                                
Brick5:
>>>>>>>>>>>                                
gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>>>>                                
Brick6:
>>>>>>>>>>>                                
gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>>>>                                
Brick7:
>>>>>>>>>>>                                
gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>>>>                                
Brick8:
>>>>>>>>>>>                                
gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>>>>                                
Options Reconfigured:
>>>>>>>>>>>                                
performance.io-thread-count: 32
>>>>>>>>>>>                                
performance.cache-size: 128MB
>>>>>>>>>>>                                
performance.write-behind-window-size:
>>>>>>>>>>>                                
128MB
>>>>>>>>>>>                                
server.allow-insecure: on
>>>>>>>>>>>                                
network.ping-timeout: 42
>>>>>>>>>>>                                
storage.owner-gid: 100
>>>>>>>>>>>                                
geo-replication.indexing: off
>>>>>>>>>>>                                
geo-replication.ignore-pid-check:
>>>>>>>>>>>                                 on
>>>>>>>>>>>                                
changelog.changelog: off
>>>>>>>>>>>                                
changelog.fsync-interval: 3
>>>>>>>>>>>                                
changelog.rollover-time: 15
>>>>>>>>>>>                                
server.manage-gids: on
>>>>>>>>>>>                                
diagnostics.client-log-level: WARNING
>>>>>>>>>>>                                
[root at gfs01a ~]# rpm -qa |
>>>>>>>>>>>                                
grep gluster
>>>>>>>>>>>                                
gluster-nagios-common-0.1.1-0.el6.noarch
>>>>>>>>>>>                                
glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-libs-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-api-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-devel-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-cli-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>>>>>>>>                                
glusterfs-server-3.6.6-1.el6.x86_64
>>>>>>>>>>>                                
glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                
_______________________________________________
>>>>>>>>>                                
Gluster-devel mailing list
>>>>>>>>>                                
Gluster-devel at gluster.org  <mailto:Gluster-devel at gluster.org>
>>>>>>>>>                                
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>                                
_______________________________________________
>>>>>>>>                                 Gluster-users
mailing list
>>>>>>>>                                 Gluster-users
at gluster.org
>>>>>>>>                                
<mailto:Gluster-users at gluster.org>
>>>>>>>>                                
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160128/6982f024/attachment.html>

Gluster users - Jan 2016 - [Gluster-devel] heal hanging

[Gluster-users] [Gluster-devel] heal hanging

[Gluster-users] [Gluster-devel] heal hanging