Ernie Dunbar
2016-Apr-14 21:51 UTC
[Gluster-users] Gluster volume heal statistics aren't changing.
Hi everyone. So, a few days ago, I installed another gluster server to our cluster to prevent split-brains. I told the server to do a self-heal operation, and sat back and waited while the performance of the cluster dropped dramatically and our customers all lost patience with us over the course of several days. Now I see that the disk on the new node has filled somewhat, but apparently the self-heal process has stalled. This is what I see when I run the "volume heal statistics heal-count" command: root at nfs3:/home/ernied# date Thu Apr 14 13:14:00 PDT 2016 root at nfs3:/home/ernied# gluster volume heal gv2 statistics heal-count Gathering count of entries to be healed on volume gv2 has been successful Brick nfs1:/brick1/gv2 Number of entries: 475 Brick nfs2:/brick1/gv2 Number of entries: 190 Brick nfs3:/brick1/gv2 Number of entries: 36 root at nfs3:/home/ernied# date Thu Apr 14 14:35:00 PDT 2016 root at nfs3:/home/ernied# gluster volume heal gv2 statistics heal-count Gathering count of entries to be healed on volume gv2 has been successful Brick nfs1:/brick1/gv2 Number of entries: 475 Brick nfs2:/brick1/gv2 Number of entries: 190 Brick nfs3:/brick1/gv2 Number of entries: 36 After an hour and 20 minutes, I see zero progress. How do I give this thing a kick in the pants to get moving? Also, after reading a bit about Gluster tuning, I suspect I may have made a mistake in creating the bricks. I hear about how we should have pairs of bricks for faster access, but we've only got 1 brick replicated over 3 servers. Or maybe that's 3 bricks all named the same thing, I'm not sure. Here's what the "volume info" command shows: root at nfs1:/home/ernied# gluster volume info Volume Name: gv2 Type: Replicate Volume ID: 3969e9cc-a2bf-4819-8c02-bf51ec0c905f Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: nfs1:/brick1/gv2 Brick2: nfs2:/brick1/gv2 Brick3: nfs3:/brick1/gv2 Options Reconfigured: cluster.server-quorum-type: none cluster.server-quorum-ratio: 51 We currently have about 618 GB of data shared on 3 6 TB RAID arrays. The data is nearly all e-mail, so a lot of small files and IMAP doing a lot of random read/write operations. Customers are not pleased with the speed of our webmail right now. Would creating a larger number of smaller bricks speed up our backend performance? Is there a way to do that non-destructively?