Pranith Kumar Karampuri
2016-Jan-22 02:21 UTC
[Gluster-users] [Gluster-devel] heal hanging
On 01/22/2016 07:25 AM, Glomski, Patrick wrote:> Unfortunately, all samba mounts to the gluster volume through the > gfapi vfs plugin have been disabled for the last 6 hours or so and > frequency of %cpu spikes is increased. We had switched to sharing a > fuse mount through samba, but I just disabled that as well. There are > no samba shares of this volume now. The spikes now happen every thirty > minutes or so. We've resorted to just rebooting the machine with high > load for the present.Could you see if the logs of following type are not at all coming? [2016-01-21 15:13:00.005736] E [server-rpc-fops.c:768:server_getxattr_cbk] 0-homegfs-server: 110: GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c 32ca6) (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied) These are operations that failed. Operations that succeed are the ones that will scan the directory. But I don't have a way to find them other than using tcpdumps. At the moment I have 2 theories: 1) these get_real_filename calls 2) [2016-01-21 16:10:38.017828] E [server-helpers.c:46:gid_resolve] 0-gid-cache: getpwuid_r(494) failed " Yessir they are. Normally, sssd would look to the local cache file in /var/lib/sss/db/ first, to get any group or userid information, then go out to the domain controller. I put the options that we are using on our GFS volumes below? Thanks for your help. We had been running sssd with sssd_nss and sssd_be sub-processes on these systems for a long time, under the GFS 3.5.2 code, and not run into the problem that David described with the high cpu usage on sssd_nss. *" *That was Tom Young's email 1.5 years back when we debugged it. But the process which was consuming lot of cpu is sssd_nss. So I am not sure if it is same issue. Let us debug to see '1)' doesn't happen. The gstack traces I asked for should also help. Pranith> > On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri > <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote: > > > > On 01/22/2016 07:13 AM, Glomski, Patrick wrote: >> We use the samba glusterfs virtual filesystem (the current >> version provided on download.gluster.org >> <http://download.gluster.org>), but no windows clients connecting >> directly. > > Hmm.. Is there a way to disable using this and check if the CPU% > still increases? What getxattr of "glusterfs.get_real_filename > <filanme>" does is to scan the entire directory looking for > strcasecmp(<filname>, <scanned-filename>). If anything matches > then it will return the <scanned-filename>. But the problem is the > scan is costly. So I wonder if this is the reason for the CPU spikes. > > Pranith > >> >> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri >> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote: >> >> Do you have any windows clients? I see a lot of getxattr >> calls for "glusterfs.get_real_filename" which lead to full >> readdirs of the directories on the brick. >> >> Pranith >> >> On 01/22/2016 12:51 AM, Glomski, Patrick wrote: >>> Pranith, could this kind of behavior be self-inflicted by us >>> deleting files directly from the bricks? We have done that >>> in the past to clean up an issues where gluster wouldn't >>> allow us to delete from the mount. >>> >>> If so, is it feasible to clean them up by running a search >>> on the .glusterfs directories directly and removing files >>> with a reference count of 1 that are non-zero size (or >>> directly checking the xattrs to be sure that it's not a DHT >>> link). >>> >>> find /data/brick01a/homegfs/.glusterfs -type f -not -empty >>> -links -2 -exec rm -f "{}" \; >>> >>> Is there anything I'm inherently missing with that approach >>> that will further corrupt the system? >>> >>> >>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick >>> <patrick.glomski at corvidtec.com >>> <mailto:patrick.glomski at corvidtec.com>> wrote: >>> >>> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. >>> Crawl has been running on one of the bricks on gfs02b >>> for 25 min or so and users cannot access the volume. >>> >>> I re-listed the xattrop directories as well as a 'top' >>> entry and heal statistics. Then I restarted the gluster >>> services on gfs02a. >>> >>> =================== top ==================>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >>> COMMAND >>> 8969 root 20 0 2815m 204m 3588 S 1181.0 0.6 >>> 591:06.93 glusterfsd >>> >>> =================== xattrop ==================>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >>> xattrop-41f19453-91e4-437c-afa9-3b25614de210 >>> xattrop-9b815879-2f4d-402b-867c-a6d65087788c >>> >>> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393 >>> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6 >>> >>> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8 >>> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125 >>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934 >>> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0 >>> >>> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc >>> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413 >>> >>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531 >>> >>> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70 >>> >>> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8 >>> c9ce22ed-6d8b-471b-a111-b39e57f0b512 >>> 94fa1d60-45ad-4341-b69c-315936b51e8d >>> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7 >>> >>> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d >>> >>> >>> =================== heal stats ==================>>> >>> homegfs [b0-gfsib01a] : Starting time of crawl : >>> Thu Jan 21 12:36:45 2016 >>> homegfs [b0-gfsib01a] : Ending time of crawl : >>> Thu Jan 21 12:36:45 2016 >>> homegfs [b0-gfsib01a] : Type of crawl: INDEX >>> homegfs [b0-gfsib01a] : No. of entries healed : 0 >>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >>> >>> homegfs [b1-gfsib01b] : Starting time of crawl : >>> Thu Jan 21 12:36:19 2016 >>> homegfs [b1-gfsib01b] : Ending time of crawl : >>> Thu Jan 21 12:36:19 2016 >>> homegfs [b1-gfsib01b] : Type of crawl: INDEX >>> homegfs [b1-gfsib01b] : No. of entries healed : 0 >>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >>> >>> homegfs [b2-gfsib01a] : Starting time of crawl : >>> Thu Jan 21 12:36:48 2016 >>> homegfs [b2-gfsib01a] : Ending time of crawl : >>> Thu Jan 21 12:36:48 2016 >>> homegfs [b2-gfsib01a] : Type of crawl: INDEX >>> homegfs [b2-gfsib01a] : No. of entries healed : 0 >>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >>> >>> homegfs [b3-gfsib01b] : Starting time of crawl : >>> Thu Jan 21 12:36:47 2016 >>> homegfs [b3-gfsib01b] : Ending time of crawl : >>> Thu Jan 21 12:36:47 2016 >>> homegfs [b3-gfsib01b] : Type of crawl: INDEX >>> homegfs [b3-gfsib01b] : No. of entries healed : 0 >>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >>> >>> homegfs [b4-gfsib02a] : Starting time of crawl : >>> Thu Jan 21 12:36:06 2016 >>> homegfs [b4-gfsib02a] : Ending time of crawl : >>> Thu Jan 21 12:36:06 2016 >>> homegfs [b4-gfsib02a] : Type of crawl: INDEX >>> homegfs [b4-gfsib02a] : No. of entries healed : 0 >>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >>> homegfs [b4-gfsib02a] : No. of heal failed entries : 0 >>> >>> homegfs [b5-gfsib02b] : Starting time of crawl : >>> Thu Jan 21 12:13:40 2016 >>> homegfs [b5-gfsib02b] : *** Crawl is in progress *** >>> homegfs [b5-gfsib02b] : Type of crawl: INDEX >>> homegfs [b5-gfsib02b] : No. of entries healed : 0 >>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >>> homegfs [b5-gfsib02b] : No. of heal failed entries : 0 >>> >>> homegfs [b6-gfsib02a] : Starting time of crawl : >>> Thu Jan 21 12:36:58 2016 >>> homegfs [b6-gfsib02a] : Ending time of crawl : >>> Thu Jan 21 12:36:58 2016 >>> homegfs [b6-gfsib02a] : Type of crawl: INDEX >>> homegfs [b6-gfsib02a] : No. of entries healed : 0 >>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >>> >>> homegfs [b7-gfsib02b] : Starting time of crawl : >>> Thu Jan 21 12:36:50 2016 >>> homegfs [b7-gfsib02b] : Ending time of crawl : >>> Thu Jan 21 12:36:50 2016 >>> homegfs [b7-gfsib02b] : Type of crawl: INDEX >>> homegfs [b7-gfsib02b] : No. of entries healed : 0 >>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >>> >>> >>> =======================================================================================>>> I waited a few minutes for the heals to finish and ran >>> the heal statistics and info again. one file is in >>> split-brain. Aside from the split-brain, the load on all >>> systems is down now and they are behaving normally. >>> glustershd.log is attached. What is going on??? >>> >>> Thu Jan 21 12:53:50 EST 2016 >>> >>> =================== homegfs ==================>>> >>> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan >>> 21 12:53:02 2016 >>> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan >>> 21 12:53:02 2016 >>> homegfs [b0-gfsib01a] : Type of crawl: INDEX >>> homegfs [b0-gfsib01a] : No. of entries healed : 0 >>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >>> >>> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan >>> 21 12:53:38 2016 >>> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan >>> 21 12:53:38 2016 >>> homegfs [b1-gfsib01b] : Type of crawl: INDEX >>> homegfs [b1-gfsib01b] : No. of entries healed : 0 >>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >>> >>> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b2-gfsib01a] : Type of crawl: INDEX >>> homegfs [b2-gfsib01a] : No. of entries healed : 0 >>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >>> >>> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b3-gfsib01b] : Type of crawl: INDEX >>> homegfs [b3-gfsib01b] : No. of entries healed : 0 >>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >>> >>> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan >>> 21 12:53:33 2016 >>> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan >>> 21 12:53:33 2016 >>> homegfs [b4-gfsib02a] : Type of crawl: INDEX >>> homegfs [b4-gfsib02a] : No. of entries healed : 0 >>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >>> homegfs [b4-gfsib02a] : No. of heal failed entries : 1 >>> >>> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan >>> 21 12:53:14 2016 >>> homegfs [b5-gfsib02b] : Ending time of crawl : Thu Jan >>> 21 12:53:15 2016 >>> homegfs [b5-gfsib02b] : Type of crawl: INDEX >>> homegfs [b5-gfsib02b] : No. of entries healed : 0 >>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >>> homegfs [b5-gfsib02b] : No. of heal failed entries : 3 >>> >>> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan >>> 21 12:53:04 2016 >>> homegfs [b6-gfsib02a] : Type of crawl: INDEX >>> homegfs [b6-gfsib02a] : No. of entries healed : 0 >>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >>> >>> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan >>> 21 12:53:09 2016 >>> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan >>> 21 12:53:09 2016 >>> homegfs [b7-gfsib02b] : Type of crawl: INDEX >>> homegfs [b7-gfsib02b] : No. of entries healed : 0 >>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >>> >>> *** gluster bug in 'gluster volume heal homegfs >>> statistics' *** >>> *** Use 'gluster volume heal homegfs info' until bug is >>> fixed *** >>> >>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>> /users/bangell/.gconfd - Is in split-brain >>> >>> Number of entries: 1 >>> >>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>> /users/bangell/.gconfd - Is in split-brain >>> >>> /users/bangell/.gconfd/saved_state >>> Number of entries: 2 >>> >>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> >>> >>> >>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar >>> Karampuri <pkarampu at redhat.com >>> <mailto:pkarampu at redhat.com>> wrote: >>> >>> >>> >>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote: >>>> I should mention that the problem is not currently >>>> occurring and there are no heals (output appended). >>>> By restarting the gluster services, we can stop the >>>> crawl, which lowers the load for a while. >>>> Subsequent crawls seem to finish properly. For what >>>> it's worth, files/folders that show up in the >>>> 'volume info' output during a hung crawl don't seem >>>> to be anything out of the ordinary. >>>> >>>> Over the past four days, the typical time before >>>> the problem recurs after suppressing it in this >>>> manner is an hour. Last night when we reached out >>>> to you was the last time it happened and the load >>>> has been low since (a relief). David believes that >>>> recursively listing the files (ls -alR or similar) >>>> from a client mount can force the issue to happen, >>>> but obviously I'd rather not unless we have some >>>> precise thing we're looking for. Let me know if >>>> you'd like me to attempt to drive the system >>>> unstable like that and what I should look for. As >>>> it's a production system, I'd rather not leave it >>>> in this state for long. >>> >>> Will it be possible to send glustershd, mount logs >>> of the past 4 days? I would like to see if this is >>> because of directory self-heal going wild (Ravi is >>> working on throttling feature for 3.8, which will >>> allow to put breaks on self-heal traffic) >>> >>> Pranith >>> >>>> >>>> [root at gfs01a xattrop]# gluster volume heal homegfs info >>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>>> Number of entries: 0 >>>> >>>> >>>> >>>> >>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar >>>> Karampuri <pkarampu at redhat.com >>>> <mailto:pkarampu at redhat.com>> wrote: >>>> >>>> >>>> >>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote: >>>>> Hello, Pranith. The typical behavior is that >>>>> the %cpu on a glusterfsd process jumps to >>>>> number of processor cores available (800% or >>>>> 1200%, depending on the pair of nodes >>>>> involved) and the load average on the machine >>>>> goes very high (~20). The volume's heal >>>>> statistics output shows that it is crawling >>>>> one of the bricks and trying to heal, but this >>>>> crawl hangs and never seems to finish. >>>>> >>>>> The number of files in the xattrop directory >>>>> varies over time, so I ran a wc -l as you >>>>> requested periodically for some time and then >>>>> started including a datestamped list of the >>>>> files that were in the xattrops directory on >>>>> each brick to see which were persistent. All >>>>> bricks had files in the xattrop folder, so all >>>>> results are attached. >>>> Thanks this info is helpful. I don't see a lot >>>> of files. Could you give output of "gluster >>>> volume heal <volname> info"? Is there any >>>> directory in there which is LARGE? >>>> >>>> Pranith >>>> >>>>> >>>>> Please let me know if there is anything else I >>>>> can provide. >>>>> >>>>> Patrick >>>>> >>>>> >>>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith >>>>> Kumar Karampuri <pkarampu at redhat.com >>>>> <mailto:pkarampu at redhat.com>> wrote: >>>>> >>>>> hey, >>>>> Which process is consuming so much >>>>> cpu? I went through the logs you gave me. >>>>> I see that the following files are in gfid >>>>> mismatch state: >>>>> >>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>, >>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>, >>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>, >>>>> >>>>> Could you give me the output of "ls >>>>> <brick-path>/indices/xattrop | wc -l" >>>>> output on all the bricks which are acting >>>>> this way? This will tell us the number of >>>>> pending self-heals on the system. >>>>> >>>>> Pranith >>>>> >>>>> >>>>> On 01/20/2016 09:26 PM, David Robinson wrote: >>>>>> resending with parsed logs... >>>>>>>> I am having issues with 3.6.6 where the >>>>>>>> load will spike up to 800% for one of >>>>>>>> the glusterfsd processes and the users >>>>>>>> can no longer access the system. If I >>>>>>>> reboot the node, the heal will finish >>>>>>>> normally after a few minutes and the >>>>>>>> system will be responsive, but a few >>>>>>>> hours later the issue will start again. >>>>>>>> It look like it is hanging in a heal >>>>>>>> and spinning up the load on one of the >>>>>>>> bricks. The heal gets stuck and says >>>>>>>> it is crawling and never returns. After >>>>>>>> a few minutes of the heal saying it is >>>>>>>> crawling, the load spikes up and the >>>>>>>> mounts become unresponsive. >>>>>>>> Any suggestions on how to fix this? It >>>>>>>> has us stopped cold as the user can no >>>>>>>> longer access the systems when the load >>>>>>>> spikes... Logs attached. >>>>>>>> System setup info is: >>>>>>>> [root at gfs01a ~]# gluster volume info >>>>>>>> homegfs >>>>>>>> >>>>>>>> Volume Name: homegfs >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: >>>>>>>> 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>>>>> Status: Started >>>>>>>> Number of Bricks: 4 x 2 = 8 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: >>>>>>>> gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick2: >>>>>>>> gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick3: >>>>>>>> gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick4: >>>>>>>> gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Brick5: >>>>>>>> gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick6: >>>>>>>> gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick7: >>>>>>>> gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick8: >>>>>>>> gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Options Reconfigured: >>>>>>>> performance.io-thread-count: 32 >>>>>>>> performance.cache-size: 128MB >>>>>>>> performance.write-behind-window-size: 128MB >>>>>>>> server.allow-insecure: on >>>>>>>> network.ping-timeout: 42 >>>>>>>> storage.owner-gid: 100 >>>>>>>> geo-replication.indexing: off >>>>>>>> geo-replication.ignore-pid-check: on >>>>>>>> changelog.changelog: off >>>>>>>> changelog.fsync-interval: 3 >>>>>>>> changelog.rollover-time: 15 >>>>>>>> server.manage-gids: on >>>>>>>> diagnostics.client-log-level: WARNING >>>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster >>>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch >>>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-api-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64 >>>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >>>>>>>> glusterfs-server-3.6.6-1.el6.x86_64 >>>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-devel mailing list >>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> <mailto:Gluster-users at gluster.org> >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160122/42006861/attachment.html>
Last entry for get_real_filename on any of the bricks was when we turned off the samba gfapi vfs plugin earlier today: /var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21 15:13:00.008239] E [server-rpc-fops.c:768:server_getxattr_cbk] 0-homegfs-server: 105: GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c32ca6) (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied) We'll get back to you with those traces when %cpu spikes again. As with most sporadic problems, as soon as you want something out of it, the issue becomes harder to reproduce. On Thu, Jan 21, 2016 at 9:21 PM, Pranith Kumar Karampuri < pkarampu at redhat.com> wrote:> > > On 01/22/2016 07:25 AM, Glomski, Patrick wrote: > > Unfortunately, all samba mounts to the gluster volume through the gfapi > vfs plugin have been disabled for the last 6 hours or so and frequency of > %cpu spikes is increased. We had switched to sharing a fuse mount through > samba, but I just disabled that as well. There are no samba shares of this > volume now. The spikes now happen every thirty minutes or so. We've > resorted to just rebooting the machine with high load for the present. > > > Could you see if the logs of following type are not at all coming? > [2016-01-21 15:13:00.005736] E [server-rpc-fops.c:768:server_getxattr_cbk] > 0-homegfs-server: 110: GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c > 32ca6) (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied) > > These are operations that failed. Operations that succeed are the ones > that will scan the directory. But I don't have a way to find them other > than using tcpdumps. > > At the moment I have 2 theories: > 1) these get_real_filename calls > 2) [2016-01-21 16:10:38.017828] E [server-helpers.c:46:gid_resolve] > 0-gid-cache: getpwuid_r(494) failed > " > > Yessir they are. Normally, sssd would look to the local cache file in > /var/lib/sss/db/ first, to get any group or userid information, then go out > to the domain controller. I put the options that we are using on our GFS > volumes below? Thanks for your help. > > > > We had been running sssd with sssd_nss and sssd_be sub-processes on these > systems for a long time, under the GFS 3.5.2 code, and not run into the > problem that David described with the high cpu usage on sssd_nss. > > *" *That was Tom Young's email 1.5 years back when we debugged it. But > the process which was consuming lot of cpu is sssd_nss. So I am not sure if > it is same issue. Let us debug to see '1)' doesn't happen. The gstack > traces I asked for should also help. > > > Pranith > > > On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri < > pkarampu at redhat.com> wrote: > >> >> >> On 01/22/2016 07:13 AM, Glomski, Patrick wrote: >> >> We use the samba glusterfs virtual filesystem (the current version >> provided on download.gluster.org), but no windows clients connecting >> directly. >> >> >> Hmm.. Is there a way to disable using this and check if the CPU% still >> increases? What getxattr of "glusterfs.get_real_filename <filanme>" does is >> to scan the entire directory looking for strcasecmp(<filname>, >> <scanned-filename>). If anything matches then it will return the >> <scanned-filename>. But the problem is the scan is costly. So I wonder if >> this is the reason for the CPU spikes. >> >> Pranith >> >> >> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri < >> pkarampu at redhat.com> wrote: >> >>> Do you have any windows clients? I see a lot of getxattr calls for >>> "glusterfs.get_real_filename" which lead to full readdirs of the >>> directories on the brick. >>> >>> Pranith >>> >>> On 01/22/2016 12:51 AM, Glomski, Patrick wrote: >>> >>> Pranith, could this kind of behavior be self-inflicted by us deleting >>> files directly from the bricks? We have done that in the past to clean up >>> an issues where gluster wouldn't allow us to delete from the mount. >>> >>> If so, is it feasible to clean them up by running a search on the >>> .glusterfs directories directly and removing files with a reference count >>> of 1 that are non-zero size (or directly checking the xattrs to be sure >>> that it's not a DHT link). >>> >>> find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links -2 >>> -exec rm -f "{}" \; >>> >>> Is there anything I'm inherently missing with that approach that will >>> further corrupt the system? >>> >>> >>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick < >>> patrick.glomski at corvidtec.com> wrote: >>> >>>> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl has been >>>> running on one of the bricks on gfs02b for 25 min or so and users cannot >>>> access the volume. >>>> >>>> I re-listed the xattrop directories as well as a 'top' entry and heal >>>> statistics. Then I restarted the gluster services on gfs02a. >>>> >>>> =================== top ==================>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >>>> COMMAND >>>> 8969 root 20 0 2815m 204m 3588 S 1181.0 0.6 591:06.93 >>>> glusterfsd >>>> >>>> =================== xattrop ==================>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-41f19453-91e4-437c-afa9-3b25614de210 >>>> xattrop-9b815879-2f4d-402b-867c-a6d65087788c >>>> >>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393 >>>> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6 >>>> >>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >>>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8 >>>> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125 >>>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934 >>>> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0 >>>> >>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc >>>> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413 >>>> >>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531 >>>> >>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70 >>>> >>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >>>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8 >>>> c9ce22ed-6d8b-471b-a111-b39e57f0b512 >>>> 94fa1d60-45ad-4341-b69c-315936b51e8d >>>> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7 >>>> >>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >>>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d >>>> >>>> >>>> =================== heal stats ==================>>>> >>>> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 >>>> 12:36:45 2016 >>>> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 >>>> 12:36:45 2016 >>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX >>>> homegfs [b0-gfsib01a] : No. of entries healed : 0 >>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >>>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 >>>> 12:36:19 2016 >>>> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 >>>> 12:36:19 2016 >>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX >>>> homegfs [b1-gfsib01b] : No. of entries healed : 0 >>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >>>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >>>> >>>> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 >>>> 12:36:48 2016 >>>> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 >>>> 12:36:48 2016 >>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX >>>> homegfs [b2-gfsib01a] : No. of entries healed : 0 >>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >>>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 >>>> 12:36:47 2016 >>>> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 >>>> 12:36:47 2016 >>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX >>>> homegfs [b3-gfsib01b] : No. of entries healed : 0 >>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >>>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >>>> >>>> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 >>>> 12:36:06 2016 >>>> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 >>>> 12:36:06 2016 >>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX >>>> homegfs [b4-gfsib02a] : No. of entries healed : 0 >>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >>>> homegfs [b4-gfsib02a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 >>>> 12:13:40 2016 >>>> homegfs [b5-gfsib02b] : *** Crawl is in >>>> progress *** >>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX >>>> homegfs [b5-gfsib02b] : No. of entries healed : 0 >>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >>>> homegfs [b5-gfsib02b] : No. of heal failed entries : 0 >>>> >>>> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 >>>> 12:36:58 2016 >>>> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 >>>> 12:36:58 2016 >>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX >>>> homegfs [b6-gfsib02a] : No. of entries healed : 0 >>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >>>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 >>>> 12:36:50 2016 >>>> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 >>>> 12:36:50 2016 >>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX >>>> homegfs [b7-gfsib02b] : No. of entries healed : 0 >>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >>>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >>>> >>>> >>>> >>>> =======================================================================================>>>> I waited a few minutes for the heals to finish and ran the heal >>>> statistics and info again. one file is in split-brain. Aside from the >>>> split-brain, the load on all systems is down now and they are behaving >>>> normally. glustershd.log is attached. What is going on??? >>>> >>>> Thu Jan 21 12:53:50 EST 2016 >>>> >>>> =================== homegfs ==================>>>> >>>> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 >>>> 12:53:02 2016 >>>> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 >>>> 12:53:02 2016 >>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX >>>> homegfs [b0-gfsib01a] : No. of entries healed : 0 >>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >>>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 >>>> 12:53:38 2016 >>>> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 >>>> 12:53:38 2016 >>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX >>>> homegfs [b1-gfsib01b] : No. of entries healed : 0 >>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >>>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >>>> >>>> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX >>>> homegfs [b2-gfsib01a] : No. of entries healed : 0 >>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >>>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX >>>> homegfs [b3-gfsib01b] : No. of entries healed : 0 >>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >>>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >>>> >>>> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 >>>> 12:53:33 2016 >>>> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 >>>> 12:53:33 2016 >>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX >>>> homegfs [b4-gfsib02a] : No. of entries healed : 0 >>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >>>> homegfs [b4-gfsib02a] : No. of heal failed entries : 1 >>>> >>>> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 >>>> 12:53:14 2016 >>>> homegfs [b5-gfsib02b] : Ending time of crawl : Thu Jan 21 >>>> 12:53:15 2016 >>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX >>>> homegfs [b5-gfsib02b] : No. of entries healed : 0 >>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >>>> homegfs [b5-gfsib02b] : No. of heal failed entries : 3 >>>> >>>> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 >>>> 12:53:04 2016 >>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX >>>> homegfs [b6-gfsib02a] : No. of entries healed : 0 >>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >>>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >>>> >>>> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 >>>> 12:53:09 2016 >>>> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 >>>> 12:53:09 2016 >>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX >>>> homegfs [b7-gfsib02b] : No. of entries healed : 0 >>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >>>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >>>> >>>> *** gluster bug in 'gluster volume heal homegfs statistics' *** >>>> *** Use 'gluster volume heal homegfs info' until bug is fixed *** >>>> >>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>>> /users/bangell/.gconfd - Is in split-brain >>>> >>>> Number of entries: 1 >>>> >>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>>> /users/bangell/.gconfd - Is in split-brain >>>> >>>> /users/bangell/.gconfd/saved_state >>>> Number of entries: 2 >>>> >>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>>> Number of entries: 0 >>>> >>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>>> Number of entries: 0 >>>> >>>> >>>> >>>> >>>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri < >>>> pkarampu at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote: >>>>> >>>>> I should mention that the problem is not currently occurring and there >>>>> are no heals (output appended). By restarting the gluster services, we can >>>>> stop the crawl, which lowers the load for a while. Subsequent crawls seem >>>>> to finish properly. For what it's worth, files/folders that show up in the >>>>> 'volume info' output during a hung crawl don't seem to be anything out of >>>>> the ordinary. >>>>> >>>>> Over the past four days, the typical time before the problem recurs >>>>> after suppressing it in this manner is an hour. Last night when we reached >>>>> out to you was the last time it happened and the load has been low since (a >>>>> relief). David believes that recursively listing the files (ls -alR or >>>>> similar) from a client mount can force the issue to happen, but obviously >>>>> I'd rather not unless we have some precise thing we're looking for. Let me >>>>> know if you'd like me to attempt to drive the system unstable like that and >>>>> what I should look for. As it's a production system, I'd rather not leave >>>>> it in this state for long. >>>>> >>>>> >>>>> Will it be possible to send glustershd, mount logs of the past 4 days? >>>>> I would like to see if this is because of directory self-heal going wild >>>>> (Ravi is working on throttling feature for 3.8, which will allow to put >>>>> breaks on self-heal traffic) >>>>> >>>>> Pranith >>>>> >>>>> >>>>> [root at gfs01a xattrop]# gluster volume heal homegfs info >>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>>>> Number of entries: 0 >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri < >>>>> pkarampu at redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote: >>>>>> >>>>>> Hello, Pranith. The typical behavior is that the %cpu on a glusterfsd >>>>>> process jumps to number of processor cores available (800% or 1200%, >>>>>> depending on the pair of nodes involved) and the load average on the >>>>>> machine goes very high (~20). The volume's heal statistics output shows >>>>>> that it is crawling one of the bricks and trying to heal, but this crawl >>>>>> hangs and never seems to finish. >>>>>> >>>>>> >>>>>> The number of files in the xattrop directory varies over time, so I >>>>>> ran a wc -l as you requested periodically for some time and then started >>>>>> including a datestamped list of the files that were in the xattrops >>>>>> directory on each brick to see which were persistent. All bricks had files >>>>>> in the xattrop folder, so all results are attached. >>>>>> >>>>>> Thanks this info is helpful. I don't see a lot of files. Could you >>>>>> give output of "gluster volume heal <volname> info"? Is there any directory >>>>>> in there which is LARGE? >>>>>> >>>>>> Pranith >>>>>> >>>>>> >>>>>> Please let me know if there is anything else I can provide. >>>>>> >>>>>> Patrick >>>>>> >>>>>> >>>>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri < >>>>>> pkarampu at redhat.com> wrote: >>>>>> >>>>>>> hey, >>>>>>> Which process is consuming so much cpu? I went through the >>>>>>> logs you gave me. I see that the following files are in gfid mismatch state: >>>>>>> >>>>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>, >>>>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>, >>>>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>, >>>>>>> >>>>>>> Could you give me the output of "ls <brick-path>/indices/xattrop | >>>>>>> wc -l" output on all the bricks which are acting this way? This will tell >>>>>>> us the number of pending self-heals on the system. >>>>>>> >>>>>>> Pranith >>>>>>> >>>>>>> >>>>>>> On 01/20/2016 09:26 PM, David Robinson wrote: >>>>>>> >>>>>>> resending with parsed logs... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am having issues with 3.6.6 where the load will spike up to 800% >>>>>>> for one of the glusterfsd processes and the users can no longer access the >>>>>>> system. If I reboot the node, the heal will finish normally after a few >>>>>>> minutes and the system will be responsive, but a few hours later the issue >>>>>>> will start again. It look like it is hanging in a heal and spinning up the >>>>>>> load on one of the bricks. The heal gets stuck and says it is crawling and >>>>>>> never returns. After a few minutes of the heal saying it is crawling, the >>>>>>> load spikes up and the mounts become unresponsive. >>>>>>> >>>>>>> Any suggestions on how to fix this? It has us stopped cold as the >>>>>>> user can no longer access the systems when the load spikes... Logs attached. >>>>>>> >>>>>>> System setup info is: >>>>>>> >>>>>>> [root at gfs01a ~]# gluster volume info homegfs >>>>>>> >>>>>>> Volume Name: homegfs >>>>>>> Type: Distributed-Replicate >>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>>>> Status: Started >>>>>>> Number of Bricks: 4 x 2 = 8 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>>>> Options Reconfigured: >>>>>>> performance.io-thread-count: 32 >>>>>>> performance.cache-size: 128MB >>>>>>> performance.write-behind-window-size: 128MB >>>>>>> server.allow-insecure: on >>>>>>> network.ping-timeout: 42 >>>>>>> storage.owner-gid: 100 >>>>>>> geo-replication.indexing: off >>>>>>> geo-replication.ignore-pid-check: on >>>>>>> changelog.changelog: off >>>>>>> changelog.fsync-interval: 3 >>>>>>> changelog.rollover-time: 15 >>>>>>> server.manage-gids: on >>>>>>> diagnostics.client-log-level: WARNING >>>>>>> >>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster >>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch >>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-api-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64 >>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >>>>>>> glusterfs-server-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-devel mailing listGluster-devel at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160121/8fa6246f/attachment.html>