thr3ads.net - Gluster users - [Gluster-users] Mount sometimes stops responding during server's MD RAID check sync

If this information is useful, please help other people find it:
Share via:

Jan Wrona

2017-May-16 14:13 UTC

[Gluster-users] Mount sometimes stops responding during server's MD RAID check sync_action

Hi,

I have three servers in the linked list topology [1], GlusterFS 3.8.10, 
CentOS 7. Each server has two bricks, both on the same XFS filesystem. 
The XFS is constructed over the whole MD RAID device:
md5 : active raid5 sdj1[6] sdh1[8] sde1[2] sdg1[9] sdd1[1] sdi1[5] 
sdf1[3] sdc1[0]
       6836411904 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[8/8] [UUUUUUUU]
       bitmap: 2/8 pages [8KB], 65536KB chunk

Everything works fine until one of the RAID devices starts its regular 
check. During the check, the client's mount sometimes completely stops 
responding. I'm mounting using the Pacemaker's Filesystem OCF RA [2] 
with OCF_CHECK_LEVEL=20, which basically tries to write a small status 
file to the filesystem every 2 minutes to see if its OK. But even this 
small write operation sometimes times out (2 minutes) during the check. 
Pacemaker then remounts the Gluster and everything goes back to normal.

I understand that the RAID check is draining a lot of I/O performance, 
but the underlying XFS remains responsive (of course it is slower, but 
by far not as much as Gluster). The check intervals on the servers are 
not overlapping. I've even decreased the 
/proc/sys/dev/raid/speed_limit_max from the default 200 MB/s to the 50 
MB/s, but it helped only a little, the mount still tends to freeze for a 
few seconds during the check.

What are your suggestions to solve this issue?

Regards,
Jan Wrona

[1] 
https://joejulian.name/blog/how-to-expand-glusterfs-replicated-clusters-by-one-server/
[2] 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170516/9663e552/attachment.html>

Pranith Kumar Karampuri

2017-May-18 04:07 UTC

head link

[Gluster-users] Mount sometimes stops responding during server's MD RAID check sync_action

Next time when this happens, could you collect statedump of the brick
processes where this activity is going on at intervals of 10 seconds?

You can refer about how to take statedump at:
https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/

On Tue, May 16, 2017 at 7:43 PM, Jan Wrona <wrona at cesnet.cz> wrote:
> Hi,
>
> I have three servers in the linked list topology [1], GlusterFS 3.8.10,
> CentOS 7. Each server has two bricks, both on the same XFS filesystem. The
> XFS is constructed over the whole MD RAID device:
> md5 : active raid5 sdj1[6] sdh1[8] sde1[2] sdg1[9] sdd1[1] sdi1[5] sdf1[3]
> sdc1[0]
>       6836411904 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/8]
> [UUUUUUUU]
>       bitmap: 2/8 pages [8KB], 65536KB chunk
>
> Everything works fine until one of the RAID devices starts its regular
> check. During the check, the client's mount sometimes completely stops
> responding. I'm mounting using the Pacemaker's Filesystem OCF RA
[2] with OCF_CHECK_LEVEL=20,
> which basically tries to write a small status file to the filesystem every
> 2 minutes to see if its OK. But even this small write operation sometimes
> times out (2 minutes) during the check. Pacemaker then remounts the Gluster
> and everything goes back to normal.
>
> I understand that the RAID check is draining a lot of I/O performance, but
> the underlying XFS remains responsive (of course it is slower, but by far
> not as much as Gluster). The check intervals on the servers are not
> overlapping. I've even decreased the /proc/sys/dev/raid/speed_limit_max
> from the default 200 MB/s to the 50 MB/s, but it helped only a little, the
> mount still tends to freeze for a few seconds during the check.
>
> What are your suggestions to solve this issue?
>
> Regards,
> Jan Wrona
>
> [1] https://joejulian.name/blog/how-to-expand-glusterfs-
> replicated-clusters-by-one-server/
> [2] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/
> Filesystem
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>


-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170518/8940e0c1/attachment.html>

Gluster users - May 2017 - Mount sometimes stops responding during server's MD RAID check sync_action

[Gluster-users] Mount sometimes stops responding during server's MD RAID check sync_action

[Gluster-users] Mount sometimes stops responding during server's MD RAID check sync_action