We use the samba glusterfs virtual filesystem (the current version provided on download.gluster.org), but no windows clients connecting directly. On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri < pkarampu at redhat.com> wrote:> Do you have any windows clients? I see a lot of getxattr calls for > "glusterfs.get_real_filename" which lead to full readdirs of the > directories on the brick. > > Pranith > > On 01/22/2016 12:51 AM, Glomski, Patrick wrote: > > Pranith, could this kind of behavior be self-inflicted by us deleting > files directly from the bricks? We have done that in the past to clean up > an issues where gluster wouldn't allow us to delete from the mount. > > If so, is it feasible to clean them up by running a search on the > .glusterfs directories directly and removing files with a reference count > of 1 that are non-zero size (or directly checking the xattrs to be sure > that it's not a DHT link). > > find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links -2 -exec > rm -f "{}" \; > > Is there anything I'm inherently missing with that approach that will > further corrupt the system? > > > On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick < > patrick.glomski at corvidtec.com> wrote: > >> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl has been >> running on one of the bricks on gfs02b for 25 min or so and users cannot >> access the volume. >> >> I re-listed the xattrop directories as well as a 'top' entry and heal >> statistics. Then I restarted the gluster services on gfs02a. >> >> =================== top ==================>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> 8969 root 20 0 2815m 204m 3588 S 1181.0 0.6 591:06.93 >> glusterfsd >> >> =================== xattrop ==================>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >> xattrop-41f19453-91e4-437c-afa9-3b25614de210 >> xattrop-9b815879-2f4d-402b-867c-a6d65087788c >> >> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >> xattrop-70131855-3cfb-49af-abce-9d23f57fb393 >> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6 >> >> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >> e6e47ed9-309b-42a7-8c44-28c29b9a20f8 >> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125 >> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934 >> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0 >> >> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc >> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413 >> >> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531 >> >> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70 >> >> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8 c9ce22ed-6d8b-471b-a111-b39e57f0b512 >> 94fa1d60-45ad-4341-b69c-315936b51e8d >> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7 >> >> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d >> >> >> =================== heal stats ==================>> >> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 >> 12:36:45 2016 >> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 >> 12:36:45 2016 >> homegfs [b0-gfsib01a] : Type of crawl: INDEX >> homegfs [b0-gfsib01a] : No. of entries healed : 0 >> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 >> 12:36:19 2016 >> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 >> 12:36:19 2016 >> homegfs [b1-gfsib01b] : Type of crawl: INDEX >> homegfs [b1-gfsib01b] : No. of entries healed : 0 >> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >> >> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 >> 12:36:48 2016 >> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 >> 12:36:48 2016 >> homegfs [b2-gfsib01a] : Type of crawl: INDEX >> homegfs [b2-gfsib01a] : No. of entries healed : 0 >> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 >> 12:36:47 2016 >> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 >> 12:36:47 2016 >> homegfs [b3-gfsib01b] : Type of crawl: INDEX >> homegfs [b3-gfsib01b] : No. of entries healed : 0 >> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >> >> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 >> 12:36:06 2016 >> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 >> 12:36:06 2016 >> homegfs [b4-gfsib02a] : Type of crawl: INDEX >> homegfs [b4-gfsib02a] : No. of entries healed : 0 >> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b4-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 >> 12:13:40 2016 >> homegfs [b5-gfsib02b] : *** Crawl is in >> progress *** >> homegfs [b5-gfsib02b] : Type of crawl: INDEX >> homegfs [b5-gfsib02b] : No. of entries healed : 0 >> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b5-gfsib02b] : No. of heal failed entries : 0 >> >> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 >> 12:36:58 2016 >> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 >> 12:36:58 2016 >> homegfs [b6-gfsib02a] : Type of crawl: INDEX >> homegfs [b6-gfsib02a] : No. of entries healed : 0 >> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 >> 12:36:50 2016 >> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 >> 12:36:50 2016 >> homegfs [b7-gfsib02b] : Type of crawl: INDEX >> homegfs [b7-gfsib02b] : No. of entries healed : 0 >> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >> >> >> >> =======================================================================================>> I waited a few minutes for the heals to finish and ran the heal >> statistics and info again. one file is in split-brain. Aside from the >> split-brain, the load on all systems is down now and they are behaving >> normally. glustershd.log is attached. What is going on??? >> >> Thu Jan 21 12:53:50 EST 2016 >> >> =================== homegfs ==================>> >> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 >> 12:53:02 2016 >> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 >> 12:53:02 2016 >> homegfs [b0-gfsib01a] : Type of crawl: INDEX >> homegfs [b0-gfsib01a] : No. of entries healed : 0 >> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 >> 12:53:38 2016 >> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 >> 12:53:38 2016 >> homegfs [b1-gfsib01b] : Type of crawl: INDEX >> homegfs [b1-gfsib01b] : No. of entries healed : 0 >> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >> >> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b2-gfsib01a] : Type of crawl: INDEX >> homegfs [b2-gfsib01a] : No. of entries healed : 0 >> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b3-gfsib01b] : Type of crawl: INDEX >> homegfs [b3-gfsib01b] : No. of entries healed : 0 >> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >> >> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 >> 12:53:33 2016 >> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 >> 12:53:33 2016 >> homegfs [b4-gfsib02a] : Type of crawl: INDEX >> homegfs [b4-gfsib02a] : No. of entries healed : 0 >> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b4-gfsib02a] : No. of heal failed entries : 1 >> >> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 >> 12:53:14 2016 >> homegfs [b5-gfsib02b] : Ending time of crawl : Thu Jan 21 >> 12:53:15 2016 >> homegfs [b5-gfsib02b] : Type of crawl: INDEX >> homegfs [b5-gfsib02b] : No. of entries healed : 0 >> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b5-gfsib02b] : No. of heal failed entries : 3 >> >> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 >> 12:53:04 2016 >> homegfs [b6-gfsib02a] : Type of crawl: INDEX >> homegfs [b6-gfsib02a] : No. of entries healed : 0 >> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 >> 12:53:09 2016 >> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 >> 12:53:09 2016 >> homegfs [b7-gfsib02b] : Type of crawl: INDEX >> homegfs [b7-gfsib02b] : No. of entries healed : 0 >> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >> >> *** gluster bug in 'gluster volume heal homegfs statistics' *** >> *** Use 'gluster volume heal homegfs info' until bug is fixed *** >> >> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >> Number of entries: 0 >> >> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >> Number of entries: 0 >> >> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >> Number of entries: 0 >> >> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >> Number of entries: 0 >> >> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >> /users/bangell/.gconfd - Is in split-brain >> >> Number of entries: 1 >> >> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >> /users/bangell/.gconfd - Is in split-brain >> >> /users/bangell/.gconfd/saved_state >> Number of entries: 2 >> >> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >> Number of entries: 0 >> >> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >> Number of entries: 0 >> >> >> >> >> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri < >> pkarampu at redhat.com> wrote: >> >>> >>> >>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote: >>> >>> I should mention that the problem is not currently occurring and there >>> are no heals (output appended). By restarting the gluster services, we can >>> stop the crawl, which lowers the load for a while. Subsequent crawls seem >>> to finish properly. For what it's worth, files/folders that show up in the >>> 'volume info' output during a hung crawl don't seem to be anything out of >>> the ordinary. >>> >>> Over the past four days, the typical time before the problem recurs >>> after suppressing it in this manner is an hour. Last night when we reached >>> out to you was the last time it happened and the load has been low since (a >>> relief). David believes that recursively listing the files (ls -alR or >>> similar) from a client mount can force the issue to happen, but obviously >>> I'd rather not unless we have some precise thing we're looking for. Let me >>> know if you'd like me to attempt to drive the system unstable like that and >>> what I should look for. As it's a production system, I'd rather not leave >>> it in this state for long. >>> >>> >>> Will it be possible to send glustershd, mount logs of the past 4 days? I >>> would like to see if this is because of directory self-heal going wild >>> (Ravi is working on throttling feature for 3.8, which will allow to put >>> breaks on self-heal traffic) >>> >>> Pranith >>> >>> >>> [root at gfs01a xattrop]# gluster volume heal homegfs info >>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> >>> >>> >>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri < >>> pkarampu at redhat.com> wrote: >>> >>>> >>>> >>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote: >>>> >>>> Hello, Pranith. The typical behavior is that the %cpu on a glusterfsd >>>> process jumps to number of processor cores available (800% or 1200%, >>>> depending on the pair of nodes involved) and the load average on the >>>> machine goes very high (~20). The volume's heal statistics output shows >>>> that it is crawling one of the bricks and trying to heal, but this crawl >>>> hangs and never seems to finish. >>>> >>>> >>>> The number of files in the xattrop directory varies over time, so I ran >>>> a wc -l as you requested periodically for some time and then started >>>> including a datestamped list of the files that were in the xattrops >>>> directory on each brick to see which were persistent. All bricks had files >>>> in the xattrop folder, so all results are attached. >>>> >>>> Thanks this info is helpful. I don't see a lot of files. Could you give >>>> output of "gluster volume heal <volname> info"? Is there any directory in >>>> there which is LARGE? >>>> >>>> Pranith >>>> >>>> >>>> Please let me know if there is anything else I can provide. >>>> >>>> Patrick >>>> >>>> >>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri < >>>> pkarampu at redhat.com> wrote: >>>> >>>>> hey, >>>>> Which process is consuming so much cpu? I went through the logs >>>>> you gave me. I see that the following files are in gfid mismatch state: >>>>> >>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>, >>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>, >>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>, >>>>> >>>>> Could you give me the output of "ls <brick-path>/indices/xattrop | wc >>>>> -l" output on all the bricks which are acting this way? This will tell us >>>>> the number of pending self-heals on the system. >>>>> >>>>> Pranith >>>>> >>>>> >>>>> On 01/20/2016 09:26 PM, David Robinson wrote: >>>>> >>>>> resending with parsed logs... >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I am having issues with 3.6.6 where the load will spike up to 800% for >>>>> one of the glusterfsd processes and the users can no longer access the >>>>> system. If I reboot the node, the heal will finish normally after a few >>>>> minutes and the system will be responsive, but a few hours later the issue >>>>> will start again. It look like it is hanging in a heal and spinning up the >>>>> load on one of the bricks. The heal gets stuck and says it is crawling and >>>>> never returns. After a few minutes of the heal saying it is crawling, the >>>>> load spikes up and the mounts become unresponsive. >>>>> >>>>> Any suggestions on how to fix this? It has us stopped cold as the >>>>> user can no longer access the systems when the load spikes... Logs attached. >>>>> >>>>> System setup info is: >>>>> >>>>> [root at gfs01a ~]# gluster volume info homegfs >>>>> >>>>> Volume Name: homegfs >>>>> Type: Distributed-Replicate >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>> Status: Started >>>>> Number of Bricks: 4 x 2 = 8 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>> Options Reconfigured: >>>>> performance.io-thread-count: 32 >>>>> performance.cache-size: 128MB >>>>> performance.write-behind-window-size: 128MB >>>>> server.allow-insecure: on >>>>> network.ping-timeout: 42 >>>>> storage.owner-gid: 100 >>>>> geo-replication.indexing: off >>>>> geo-replication.ignore-pid-check: on >>>>> changelog.changelog: off >>>>> changelog.fsync-interval: 3 >>>>> changelog.rollover-time: 15 >>>>> server.manage-gids: on >>>>> diagnostics.client-log-level: WARNING >>>>> >>>>> [root at gfs01a ~]# rpm -qa | grep gluster >>>>> gluster-nagios-common-0.1.1-0.el6.noarch >>>>> glusterfs-fuse-3.6.6-1.el6.x86_64 >>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64 >>>>> glusterfs-libs-3.6.6-1.el6.x86_64 >>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64 >>>>> glusterfs-api-3.6.6-1.el6.x86_64 >>>>> glusterfs-devel-3.6.6-1.el6.x86_64 >>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64 >>>>> glusterfs-3.6.6-1.el6.x86_64 >>>>> glusterfs-cli-3.6.6-1.el6.x86_64 >>>>> glusterfs-rdma-3.6.6-1.el6.x86_64 >>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >>>>> glusterfs-server-3.6.6-1.el6.x86_64 >>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing listGluster-devel at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>> >>>> >>>> >>> >>> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160121/0296a6ab/attachment.html>
Pranith Kumar Karampuri
2016-Jan-22 01:49 UTC
[Gluster-users] [Gluster-devel] heal hanging
On 01/22/2016 07:13 AM, Glomski, Patrick wrote:> We use the samba glusterfs virtual filesystem (the current version > provided on download.gluster.org <http://download.gluster.org>), but > no windows clients connecting directly.Hmm.. Is there a way to disable using this and check if the CPU% still increases? What getxattr of "glusterfs.get_real_filename <filanme>" does is to scan the entire directory looking for strcasecmp(<filname>, <scanned-filename>). If anything matches then it will return the <scanned-filename>. But the problem is the scan is costly. So I wonder if this is the reason for the CPU spikes. Pranith> > On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri > <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote: > > Do you have any windows clients? I see a lot of getxattr calls for > "glusterfs.get_real_filename" which lead to full readdirs of the > directories on the brick. > > Pranith > > On 01/22/2016 12:51 AM, Glomski, Patrick wrote: >> Pranith, could this kind of behavior be self-inflicted by us >> deleting files directly from the bricks? We have done that in the >> past to clean up an issues where gluster wouldn't allow us to >> delete from the mount. >> >> If so, is it feasible to clean them up by running a search on the >> .glusterfs directories directly and removing files with a >> reference count of 1 that are non-zero size (or directly checking >> the xattrs to be sure that it's not a DHT link). >> >> find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links >> -2 -exec rm -f "{}" \; >> >> Is there anything I'm inherently missing with that approach that >> will further corrupt the system? >> >> >> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick >> <patrick.glomski at corvidtec.com >> <mailto:patrick.glomski at corvidtec.com>> wrote: >> >> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl >> has been running on one of the bricks on gfs02b for 25 min or >> so and users cannot access the volume. >> >> I re-listed the xattrop directories as well as a 'top' entry >> and heal statistics. Then I restarted the gluster services on >> gfs02a. >> >> =================== top ==================>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> 8969 root 20 0 2815m 204m 3588 S 1181.0 0.6 >> 591:06.93 glusterfsd >> >> =================== xattrop ==================>> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >> xattrop-41f19453-91e4-437c-afa9-3b25614de210 >> xattrop-9b815879-2f4d-402b-867c-a6d65087788c >> >> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >> xattrop-70131855-3cfb-49af-abce-9d23f57fb393 >> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6 >> >> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >> e6e47ed9-309b-42a7-8c44-28c29b9a20f8 >> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125 >> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934 >> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0 >> >> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc >> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413 >> >> /data/brick01a/homegfs/.glusterfs/indices/xattrop: >> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531 >> >> /data/brick02a/homegfs/.glusterfs/indices/xattrop: >> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70 >> >> /data/brick01b/homegfs/.glusterfs/indices/xattrop: >> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8 >> c9ce22ed-6d8b-471b-a111-b39e57f0b512 >> 94fa1d60-45ad-4341-b69c-315936b51e8d >> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7 >> >> /data/brick02b/homegfs/.glusterfs/indices/xattrop: >> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d >> >> >> =================== heal stats ==================>> >> homegfs [b0-gfsib01a] : Starting time of crawl : Thu >> Jan 21 12:36:45 2016 >> homegfs [b0-gfsib01a] : Ending time of crawl : Thu >> Jan 21 12:36:45 2016 >> homegfs [b0-gfsib01a] : Type of crawl: INDEX >> homegfs [b0-gfsib01a] : No. of entries healed : 0 >> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b1-gfsib01b] : Starting time of crawl : Thu >> Jan 21 12:36:19 2016 >> homegfs [b1-gfsib01b] : Ending time of crawl : Thu >> Jan 21 12:36:19 2016 >> homegfs [b1-gfsib01b] : Type of crawl: INDEX >> homegfs [b1-gfsib01b] : No. of entries healed : 0 >> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >> >> homegfs [b2-gfsib01a] : Starting time of crawl : Thu >> Jan 21 12:36:48 2016 >> homegfs [b2-gfsib01a] : Ending time of crawl : Thu >> Jan 21 12:36:48 2016 >> homegfs [b2-gfsib01a] : Type of crawl: INDEX >> homegfs [b2-gfsib01a] : No. of entries healed : 0 >> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b3-gfsib01b] : Starting time of crawl : Thu >> Jan 21 12:36:47 2016 >> homegfs [b3-gfsib01b] : Ending time of crawl : Thu >> Jan 21 12:36:47 2016 >> homegfs [b3-gfsib01b] : Type of crawl: INDEX >> homegfs [b3-gfsib01b] : No. of entries healed : 0 >> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >> >> homegfs [b4-gfsib02a] : Starting time of crawl : Thu >> Jan 21 12:36:06 2016 >> homegfs [b4-gfsib02a] : Ending time of crawl : Thu >> Jan 21 12:36:06 2016 >> homegfs [b4-gfsib02a] : Type of crawl: INDEX >> homegfs [b4-gfsib02a] : No. of entries healed : 0 >> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b4-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b5-gfsib02b] : Starting time of crawl : Thu >> Jan 21 12:13:40 2016 >> homegfs [b5-gfsib02b] : *** >> Crawl is in progress *** >> homegfs [b5-gfsib02b] : Type of crawl: INDEX >> homegfs [b5-gfsib02b] : No. of entries healed : 0 >> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b5-gfsib02b] : No. of heal failed entries : 0 >> >> homegfs [b6-gfsib02a] : Starting time of crawl : Thu >> Jan 21 12:36:58 2016 >> homegfs [b6-gfsib02a] : Ending time of crawl : Thu >> Jan 21 12:36:58 2016 >> homegfs [b6-gfsib02a] : Type of crawl: INDEX >> homegfs [b6-gfsib02a] : No. of entries healed : 0 >> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b7-gfsib02b] : Starting time of crawl : Thu >> Jan 21 12:36:50 2016 >> homegfs [b7-gfsib02b] : Ending time of crawl : Thu >> Jan 21 12:36:50 2016 >> homegfs [b7-gfsib02b] : Type of crawl: INDEX >> homegfs [b7-gfsib02b] : No. of entries healed : 0 >> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >> >> >> =======================================================================================>> I waited a few minutes for the heals to finish and ran the >> heal statistics and info again. one file is in split-brain. >> Aside from the split-brain, the load on all systems is down >> now and they are behaving normally. glustershd.log is >> attached. What is going on??? >> >> Thu Jan 21 12:53:50 EST 2016 >> >> =================== homegfs ==================>> >> homegfs [b0-gfsib01a] : Starting time of crawl : Thu >> Jan 21 12:53:02 2016 >> homegfs [b0-gfsib01a] : Ending time of crawl : Thu >> Jan 21 12:53:02 2016 >> homegfs [b0-gfsib01a] : Type of crawl: INDEX >> homegfs [b0-gfsib01a] : No. of entries healed : 0 >> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b0-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b1-gfsib01b] : Starting time of crawl : Thu >> Jan 21 12:53:38 2016 >> homegfs [b1-gfsib01b] : Ending time of crawl : Thu >> Jan 21 12:53:38 2016 >> homegfs [b1-gfsib01b] : Type of crawl: INDEX >> homegfs [b1-gfsib01b] : No. of entries healed : 0 >> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b1-gfsib01b] : No. of heal failed entries : 1 >> >> homegfs [b2-gfsib01a] : Starting time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b2-gfsib01a] : Ending time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b2-gfsib01a] : Type of crawl: INDEX >> homegfs [b2-gfsib01a] : No. of entries healed : 0 >> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0 >> homegfs [b2-gfsib01a] : No. of heal failed entries : 0 >> >> homegfs [b3-gfsib01b] : Starting time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b3-gfsib01b] : Ending time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b3-gfsib01b] : Type of crawl: INDEX >> homegfs [b3-gfsib01b] : No. of entries healed : 0 >> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0 >> homegfs [b3-gfsib01b] : No. of heal failed entries : 0 >> >> homegfs [b4-gfsib02a] : Starting time of crawl : Thu >> Jan 21 12:53:33 2016 >> homegfs [b4-gfsib02a] : Ending time of crawl : Thu >> Jan 21 12:53:33 2016 >> homegfs [b4-gfsib02a] : Type of crawl: INDEX >> homegfs [b4-gfsib02a] : No. of entries healed : 0 >> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b4-gfsib02a] : No. of heal failed entries : 1 >> >> homegfs [b5-gfsib02b] : Starting time of crawl : Thu >> Jan 21 12:53:14 2016 >> homegfs [b5-gfsib02b] : Ending time of crawl : Thu >> Jan 21 12:53:15 2016 >> homegfs [b5-gfsib02b] : Type of crawl: INDEX >> homegfs [b5-gfsib02b] : No. of entries healed : 0 >> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b5-gfsib02b] : No. of heal failed entries : 3 >> >> homegfs [b6-gfsib02a] : Starting time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b6-gfsib02a] : Ending time of crawl : Thu >> Jan 21 12:53:04 2016 >> homegfs [b6-gfsib02a] : Type of crawl: INDEX >> homegfs [b6-gfsib02a] : No. of entries healed : 0 >> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0 >> homegfs [b6-gfsib02a] : No. of heal failed entries : 0 >> >> homegfs [b7-gfsib02b] : Starting time of crawl : Thu >> Jan 21 12:53:09 2016 >> homegfs [b7-gfsib02b] : Ending time of crawl : Thu >> Jan 21 12:53:09 2016 >> homegfs [b7-gfsib02b] : Type of crawl: INDEX >> homegfs [b7-gfsib02b] : No. of entries healed : 0 >> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0 >> homegfs [b7-gfsib02b] : No. of heal failed entries : 0 >> >> *** gluster bug in 'gluster volume heal homegfs statistics' *** >> *** Use 'gluster volume heal homegfs info' until bug is fixed *** >> >> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >> Number of entries: 0 >> >> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >> Number of entries: 0 >> >> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >> Number of entries: 0 >> >> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >> Number of entries: 0 >> >> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >> /users/bangell/.gconfd - Is in split-brain >> >> Number of entries: 1 >> >> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >> /users/bangell/.gconfd - Is in split-brain >> >> /users/bangell/.gconfd/saved_state >> Number of entries: 2 >> >> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >> Number of entries: 0 >> >> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >> Number of entries: 0 >> >> >> >> >> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri >> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote: >> >> >> >> On 01/21/2016 09:26 PM, Glomski, Patrick wrote: >>> I should mention that the problem is not currently >>> occurring and there are no heals (output appended). By >>> restarting the gluster services, we can stop the crawl, >>> which lowers the load for a while. Subsequent crawls >>> seem to finish properly. For what it's worth, >>> files/folders that show up in the 'volume info' output >>> during a hung crawl don't seem to be anything out of the >>> ordinary. >>> >>> Over the past four days, the typical time before the >>> problem recurs after suppressing it in this manner is an >>> hour. Last night when we reached out to you was the last >>> time it happened and the load has been low since (a >>> relief). David believes that recursively listing the >>> files (ls -alR or similar) from a client mount can force >>> the issue to happen, but obviously I'd rather not unless >>> we have some precise thing we're looking for. Let me >>> know if you'd like me to attempt to drive the system >>> unstable like that and what I should look for. As it's a >>> production system, I'd rather not leave it in this state >>> for long. >> >> Will it be possible to send glustershd, mount logs of the >> past 4 days? I would like to see if this is because of >> directory self-heal going wild (Ravi is working on >> throttling feature for 3.8, which will allow to put >> breaks on self-heal traffic) >> >> Pranith >> >>> >>> [root at gfs01a xattrop]# gluster volume heal homegfs info >>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/ >>> Number of entries: 0 >>> >>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/ >>> Number of entries: 0 >>> >>> >>> >>> >>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar >>> Karampuri <pkarampu at redhat.com >>> <mailto:pkarampu at redhat.com>> wrote: >>> >>> >>> >>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote: >>>> Hello, Pranith. The typical behavior is that the >>>> %cpu on a glusterfsd process jumps to number of >>>> processor cores available (800% or 1200%, depending >>>> on the pair of nodes involved) and the load average >>>> on the machine goes very high (~20). The volume's >>>> heal statistics output shows that it is crawling >>>> one of the bricks and trying to heal, but this >>>> crawl hangs and never seems to finish. >>>> >>>> The number of files in the xattrop directory varies >>>> over time, so I ran a wc -l as you requested >>>> periodically for some time and then started >>>> including a datestamped list of the files that were >>>> in the xattrops directory on each brick to see >>>> which were persistent. All bricks had files in the >>>> xattrop folder, so all results are attached. >>> Thanks this info is helpful. I don't see a lot of >>> files. Could you give output of "gluster volume heal >>> <volname> info"? Is there any directory in there >>> which is LARGE? >>> >>> Pranith >>> >>>> >>>> Please let me know if there is anything else I can >>>> provide. >>>> >>>> Patrick >>>> >>>> >>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar >>>> Karampuri <pkarampu at redhat.com >>>> <mailto:pkarampu at redhat.com>> wrote: >>>> >>>> hey, >>>> Which process is consuming so much cpu? >>>> I went through the logs you gave me. I see that >>>> the following files are in gfid mismatch state: >>>> >>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>, >>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>, >>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>, >>>> >>>> Could you give me the output of "ls >>>> <brick-path>/indices/xattrop | wc -l" output on >>>> all the bricks which are acting this way? This >>>> will tell us the number of pending self-heals >>>> on the system. >>>> >>>> Pranith >>>> >>>> >>>> On 01/20/2016 09:26 PM, David Robinson wrote: >>>>> resending with parsed logs... >>>>>>> I am having issues with 3.6.6 where the load >>>>>>> will spike up to 800% for one of the >>>>>>> glusterfsd processes and the users can no >>>>>>> longer access the system. If I reboot the >>>>>>> node, the heal will finish normally after a >>>>>>> few minutes and the system will be >>>>>>> responsive, but a few hours later the issue >>>>>>> will start again. It look like it is hanging >>>>>>> in a heal and spinning up the load on one of >>>>>>> the bricks. The heal gets stuck and says it >>>>>>> is crawling and never returns. After a few >>>>>>> minutes of the heal saying it is crawling, >>>>>>> the load spikes up and the mounts become >>>>>>> unresponsive. >>>>>>> Any suggestions on how to fix this? It has >>>>>>> us stopped cold as the user can no longer >>>>>>> access the systems when the load spikes... >>>>>>> Logs attached. >>>>>>> System setup info is: >>>>>>> [root at gfs01a ~]# gluster volume info homegfs >>>>>>> >>>>>>> Volume Name: homegfs >>>>>>> Type: Distributed-Replicate >>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>>>> Status: Started >>>>>>> Number of Bricks: 4 x 2 = 8 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: >>>>>>> gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>>>> Brick2: >>>>>>> gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>>>> Brick3: >>>>>>> gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>>>> Brick4: >>>>>>> gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>>>> Brick5: >>>>>>> gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>>>> Brick6: >>>>>>> gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>>>> Brick7: >>>>>>> gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>>>> Brick8: >>>>>>> gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>>>> Options Reconfigured: >>>>>>> performance.io-thread-count: 32 >>>>>>> performance.cache-size: 128MB >>>>>>> performance.write-behind-window-size: 128MB >>>>>>> server.allow-insecure: on >>>>>>> network.ping-timeout: 42 >>>>>>> storage.owner-gid: 100 >>>>>>> geo-replication.indexing: off >>>>>>> geo-replication.ignore-pid-check: on >>>>>>> changelog.changelog: off >>>>>>> changelog.fsync-interval: 3 >>>>>>> changelog.rollover-time: 15 >>>>>>> server.manage-gids: on >>>>>>> diagnostics.client-log-level: WARNING >>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster >>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch >>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-api-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64 >>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >>>>>>> glusterfs-server-3.6.6-1.el6.x86_64 >>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64 >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> >>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> <mailto:Gluster-users at gluster.org> >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>> >>> >> >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160122/f3d43fa9/attachment.html>