>I am having issues with 3.6.6 where the load will spike up to 800% for >one of the glusterfsd processes and the users can no longer access the >system. If I reboot the node, the heal will finish normally after a >few minutes and the system will be responsive, but a few hours later >the issue will start again. It look like it is hanging in a heal and >spinning up the load on one of the bricks. The heal gets stuck and >says it is crawling and never returns. After a few minutes of the heal >saying it is crawling, the load spikes up and the mounts become >unresponsive. > >Any suggestions on how to fix this? It has us stopped cold as the user >can no longer access the systems when the load spikes... Logs attached. > >System setup info is: > >[root at gfs01a ~]# gluster volume info homegfs > >Volume Name: homegfs >Type: Distributed-Replicate >Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >Status: Started >Number of Bricks: 4 x 2 = 8 >Transport-type: tcp >Bricks: >Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >Options Reconfigured: >performance.io-thread-count: 32 >performance.cache-size: 128MB >performance.write-behind-window-size: 128MB >server.allow-insecure: on >network.ping-timeout: 42 >storage.owner-gid: 100 >geo-replication.indexing: off >geo-replication.ignore-pid-check: on >changelog.changelog: off >changelog.fsync-interval: 3 >changelog.rollover-time: 15 >server.manage-gids: on >diagnostics.client-log-level: WARNING > >[root at gfs01a ~]# rpm -qa | grep gluster >gluster-nagios-common-0.1.1-0.el6.noarch >glusterfs-fuse-3.6.6-1.el6.x86_64 >glusterfs-debuginfo-3.6.6-1.el6.x86_64 >glusterfs-libs-3.6.6-1.el6.x86_64 >glusterfs-geo-replication-3.6.6-1.el6.x86_64 >glusterfs-api-3.6.6-1.el6.x86_64 >glusterfs-devel-3.6.6-1.el6.x86_64 >glusterfs-api-devel-3.6.6-1.el6.x86_64 >glusterfs-3.6.6-1.el6.x86_64 >glusterfs-cli-3.6.6-1.el6.x86_64 >glusterfs-rdma-3.6.6-1.el6.x86_64 >samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >glusterfs-server-3.6.6-1.el6.x86_64 >glusterfs-extra-xlators-3.6.6-1.el6.x86_64 > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160120/11cdb723/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-log.tgz Type: application/x-compressed Size: 6004421 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160120/11cdb723/attachment-0001.bin>
resending with parsed logs...> > >>I am having issues with 3.6.6 where the load will spike up to 800% for >>one of the glusterfsd processes and the users can no longer access the >>system. If I reboot the node, the heal will finish normally after a >>few minutes and the system will be responsive, but a few hours later >>the issue will start again. It look like it is hanging in a heal and >>spinning up the load on one of the bricks. The heal gets stuck and >>says it is crawling and never returns. After a few minutes of the >>heal saying it is crawling, the load spikes up and the mounts become >>unresponsive. >> >>Any suggestions on how to fix this? It has us stopped cold as the >>user can no longer access the systems when the load spikes... Logs >>attached. >> >>System setup info is: >> >>[root at gfs01a ~]# gluster volume info homegfs >> >>Volume Name: homegfs >>Type: Distributed-Replicate >>Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>Status: Started >>Number of Bricks: 4 x 2 = 8 >>Transport-type: tcp >>Bricks: >>Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>Options Reconfigured: >>performance.io-thread-count: 32 >>performance.cache-size: 128MB >>performance.write-behind-window-size: 128MB >>server.allow-insecure: on >>network.ping-timeout: 42 >>storage.owner-gid: 100 >>geo-replication.indexing: off >>geo-replication.ignore-pid-check: on >>changelog.changelog: off >>changelog.fsync-interval: 3 >>changelog.rollover-time: 15 >>server.manage-gids: on >>diagnostics.client-log-level: WARNING >> >>[root at gfs01a ~]# rpm -qa | grep gluster >>gluster-nagios-common-0.1.1-0.el6.noarch >>glusterfs-fuse-3.6.6-1.el6.x86_64 >>glusterfs-debuginfo-3.6.6-1.el6.x86_64 >>glusterfs-libs-3.6.6-1.el6.x86_64 >>glusterfs-geo-replication-3.6.6-1.el6.x86_64 >>glusterfs-api-3.6.6-1.el6.x86_64 >>glusterfs-devel-3.6.6-1.el6.x86_64 >>glusterfs-api-devel-3.6.6-1.el6.x86_64 >>glusterfs-3.6.6-1.el6.x86_64 >>glusterfs-cli-3.6.6-1.el6.x86_64 >>glusterfs-rdma-3.6.6-1.el6.x86_64 >>samba-vfs-glusterfs-4.1.11-2.el6.x86_64 >>glusterfs-server-3.6.6-1.el6.x86_64 >>glusterfs-extra-xlators-3.6.6-1.el6.x86_64 >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160120/34032f34/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-log.tgz Type: application/x-compressed Size: 880609 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160120/34032f34/attachment-0001.bin>