Hi Patrick, I guess you can collect some data via the 'gluster profile' command. At least it should show any issues from performance point of view. gluster volume profile volume start; Do an 'ls' gluster volume profile volume info gluster volume profile volume stop Also, can you define top 3 errors seen in the logs. If you manage to fix them (with the help of the community) one by one - you might restore your full functionality. By the way, do you have the option to archive the data and thus reduce the ammount stored - which obviously will increase ZFS performance. Best Regards, Strahil Nikolov On Apr 21, 2019 10:50, Patrick Rennie <patrickmrennie at gmail.com> wrote:> > Hi Darrell,? > > Thanks again for your advice, I've left it for a while but unfortunately it's still just as slow and causing more problems for our operations now. I will need to try and take some steps to at least bring performance back to normal while continuing to investigate the issue longer term. I can definitely see one node with heavier CPU than the other, almost double, which I am OK with, but I think the heal process is going to take forever, trying to check the "gluster volume heal info" shows thousands and thousands of files which may need healing, I have no idea how many in total the command is still running after hours, so I am not sure what has gone so wrong to cause this.? > > I've checked cluster.op-version and cluster.max-op-version and it looks like I'm on the latest version there.? > > I have no idea how long the healing is going to take on this cluster, we have around 560TB of data on here, but I don't think I can wait that long to try and restore performance to normal.? > > Can anyone think of anything else I can try in the meantime to work out what's causing the extreme latency?? > > I've been going through cluster client the logs of some of our VMs and on some of our FTP servers I found this in the cluster mount log, but I am not seeing it on any of our other servers, just our FTP servers.? > > [2019-04-21 07:16:19.925388] E [MSGID: 101046] [dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null > [2019-04-21 07:19:43.413834] W [MSGID: 114031] [client-rpc-fops.c:2203:client3_3_setattr_cbk] 0-gvAA01-client-19: remote operation failed [No such file or directory] > [2019-04-21 07:19:43.414153] W [MSGID: 114031] [client-rpc-fops.c:2203:client3_3_setattr_cbk] 0-gvAA01-client-20: remote operation failed [No such file or directory] > [2019-04-21 07:23:33.154717] E [MSGID: 101046] [dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null > [2019-04-21 07:33:24.943913] E [MSGID: 101046] [dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null > > Any ideas what this could mean? I am basically just grasping at straws here. > > I am going to hold off on the version upgrade until I know there are no files which need healing, which could be a while, from some reading I've done there shouldn't be any issues with this as both are on v3.12.x? > > I've free'd up a small amount of space, but I still need to work on this further.? > > I've read of a command "find .glusterfs -type f -links -2 -exec rm {} \;" which could be run on each brick and it would potentially clean up any files which were deleted straight from the bricks, but not via the client, I have a feeling this could help me free up about 5-10TB per brick from what I've been told about the history of this cluster. Can anyone confirm if this is actually safe to run?? > > At this stage, I'm open to any suggestions as to how to proceed, thanks again for any advice.? > > Cheers,? > > - Patrick > > On Sun, Apr 21, 2019 at 1:22 AM Darrell Budic <budic at onholyground.com> wrote: >> >> Patrick, >> >> Sounds like progress. Be aware that gluster is expected to max out the CPUs on at least one of your servers while healing. This is normal and won?t adversely affect overall performance (any more than having bricks in need of healing, at any rate) unless you?re overdoing it. shd threads <= 4 should not do that on your hardware. Other tunings may have also increased overall performance, so you may see higher CPU than previously anyway. I?d recommend upping those thread counts and letting it heal as fast as possible, especially if these are dedicated Gluster storage servers (Ie: not also running VMs, etc). You should see ?normal? CPU use one heals are completed. I see ~15-30% overall normally, 95-98% while healing (x my 20 cores). It?s also likely to be different between your servers, in a pure replica, one tends to max and one tends to be a little higher, in a distributed-replica, I?d expect more than one to run harder while healin-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190421/0c3d79a7/attachment.html>