thr3ads.net - Gluster users - [Gluster-users] heal hanging [Jan 2016]

If this information is useful, please help other people find it:
Share via:

David Robinson

2016-Jan-20 15:42 UTC

[Gluster-users] heal hanging

>I am having issues with 3.6.6 where the load will spike up to 800% for 
>one of the glusterfsd processes and the users can no longer access the 
>system.  If I reboot the node, the heal will finish normally after a 
>few minutes and the system will be responsive, but a few hours later 
>the issue will start again.  It look like it is hanging in a heal and 
>spinning up the load on one of the bricks.  The heal gets stuck and 
>says it is crawling and never returns.  After a few minutes of the heal 
>saying it is crawling, the load spikes up and the mounts become 
>unresponsive.
>
>Any suggestions on how to fix this?  It has us stopped cold as the user 
>can no longer access the systems when the load spikes... Logs attached.
>
>System setup info is:
>
>[root at gfs01a ~]# gluster volume info homegfs
>
>Volume Name: homegfs
>Type: Distributed-Replicate
>Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>Status: Started
>Number of Bricks: 4 x 2 = 8
>Transport-type: tcp
>Bricks:
>Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>Options Reconfigured:
>performance.io-thread-count: 32
>performance.cache-size: 128MB
>performance.write-behind-window-size: 128MB
>server.allow-insecure: on
>network.ping-timeout: 42
>storage.owner-gid: 100
>geo-replication.indexing: off
>geo-replication.ignore-pid-check: on
>changelog.changelog: off
>changelog.fsync-interval: 3
>changelog.rollover-time: 15
>server.manage-gids: on
>diagnostics.client-log-level: WARNING
>
>[root at gfs01a ~]# rpm -qa | grep gluster
>gluster-nagios-common-0.1.1-0.el6.noarch
>glusterfs-fuse-3.6.6-1.el6.x86_64
>glusterfs-debuginfo-3.6.6-1.el6.x86_64
>glusterfs-libs-3.6.6-1.el6.x86_64
>glusterfs-geo-replication-3.6.6-1.el6.x86_64
>glusterfs-api-3.6.6-1.el6.x86_64
>glusterfs-devel-3.6.6-1.el6.x86_64
>glusterfs-api-devel-3.6.6-1.el6.x86_64
>glusterfs-3.6.6-1.el6.x86_64
>glusterfs-cli-3.6.6-1.el6.x86_64
>glusterfs-rdma-3.6.6-1.el6.x86_64
>samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>glusterfs-server-3.6.6-1.el6.x86_64
>glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160120/11cdb723/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glusterfs-log.tgz
Type: application/x-compressed
Size: 6004421 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160120/11cdb723/attachment-0001.bin>

David Robinson

2016-Jan-20 15:56 UTC

head link

[Gluster-users] heal hanging

resending with parsed logs...
>
>
>>I am having issues with 3.6.6 where the load will spike up to 800% for 
>>one of the glusterfsd processes and the users can no longer access the 
>>system.  If I reboot the node, the heal will finish normally after a 
>>few minutes and the system will be responsive, but a few hours later 
>>the issue will start again.  It look like it is hanging in a heal and 
>>spinning up the load on one of the bricks.  The heal gets stuck and 
>>says it is crawling and never returns.  After a few minutes of the 
>>heal saying it is crawling, the load spikes up and the mounts become 
>>unresponsive.
>>
>>Any suggestions on how to fix this?  It has us stopped cold as the 
>>user can no longer access the systems when the load spikes... Logs 
>>attached.
>>
>>System setup info is:
>>
>>[root at gfs01a ~]# gluster volume info homegfs
>>
>>Volume Name: homegfs
>>Type: Distributed-Replicate
>>Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>Status: Started
>>Number of Bricks: 4 x 2 = 8
>>Transport-type: tcp
>>Bricks:
>>Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>Options Reconfigured:
>>performance.io-thread-count: 32
>>performance.cache-size: 128MB
>>performance.write-behind-window-size: 128MB
>>server.allow-insecure: on
>>network.ping-timeout: 42
>>storage.owner-gid: 100
>>geo-replication.indexing: off
>>geo-replication.ignore-pid-check: on
>>changelog.changelog: off
>>changelog.fsync-interval: 3
>>changelog.rollover-time: 15
>>server.manage-gids: on
>>diagnostics.client-log-level: WARNING
>>
>>[root at gfs01a ~]# rpm -qa | grep gluster
>>gluster-nagios-common-0.1.1-0.el6.noarch
>>glusterfs-fuse-3.6.6-1.el6.x86_64
>>glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>glusterfs-libs-3.6.6-1.el6.x86_64
>>glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>glusterfs-api-3.6.6-1.el6.x86_64
>>glusterfs-devel-3.6.6-1.el6.x86_64
>>glusterfs-api-devel-3.6.6-1.el6.x86_64
>>glusterfs-3.6.6-1.el6.x86_64
>>glusterfs-cli-3.6.6-1.el6.x86_64
>>glusterfs-rdma-3.6.6-1.el6.x86_64
>>samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>glusterfs-server-3.6.6-1.el6.x86_64
>>glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160120/34032f34/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glusterfs-log.tgz
Type: application/x-compressed
Size: 880609 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160120/34032f34/attachment-0001.bin>

Gluster users - Jan 2016 - heal hanging

[Gluster-users] heal hanging

[Gluster-users] heal hanging