thr3ads.net - Gluster users - [Gluster-users] Extremely slow cluster performance [Apr 2019]

If this information is useful, please help other people find it:
Share via:
Strahil
2019-Apr-21 11:51 UTC
[Gluster-users] Extremely slow cluster performance

Hi Patrick,

I guess you can collect some data via the 'gluster profile' command.
At least it should show any issues from performance point of view.
gluster volume profile volume  start;
Do an 'ls'
gluster volume profile volume info
gluster volume profile volume stop

Also, can you define top 3 errors seen in the logs. If you manage to fix them
(with the help of the community) one by one - you might restore your full
functionality.

By the way, do you have the option to archive the data and thus reduce the
ammount stored - which obviously will increase ZFS performance.

Best Regards,
Strahil Nikolov


On Apr 21, 2019 10:50, Patrick Rennie <patrickmrennie at gmail.com>
wrote:>
> Hi Darrell,?
>
> Thanks again for your advice, I've left it for a while but
unfortunately it's still just as slow and causing more problems for our
operations now. I will need to try and take some steps to at least bring
performance back to normal while continuing to investigate the issue longer
term. I can definitely see one node with heavier CPU than the other, almost
double, which I am OK with, but I think the heal process is going to take
forever, trying to check the "gluster volume heal info" shows
thousands and thousands of files which may need healing, I have no idea how many
in total the command is still running after hours, so I am not sure what has
gone so wrong to cause this.?
>
> I've checked cluster.op-version and cluster.max-op-version and it looks
like I'm on the latest version there.?
>
> I have no idea how long the healing is going to take on this cluster, we
have around 560TB of data on here, but I don't think I can wait that long to
try and restore performance to normal.?
>
> Can anyone think of anything else I can try in the meantime to work out
what's causing the extreme latency??
>
> I've been going through cluster client the logs of some of our VMs and
on some of our FTP servers I found this in the cluster mount log, but I am not
seeing it on any of our other servers, just our FTP servers.?
>
> [2019-04-21 07:16:19.925388] E [MSGID: 101046]
[dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null
> [2019-04-21 07:19:43.413834] W [MSGID: 114031]
[client-rpc-fops.c:2203:client3_3_setattr_cbk] 0-gvAA01-client-19: remote
operation failed [No such file or directory]
> [2019-04-21 07:19:43.414153] W [MSGID: 114031]
[client-rpc-fops.c:2203:client3_3_setattr_cbk] 0-gvAA01-client-20: remote
operation failed [No such file or directory]
> [2019-04-21 07:23:33.154717] E [MSGID: 101046]
[dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null
> [2019-04-21 07:33:24.943913] E [MSGID: 101046]
[dht-common.c:1904:dht_revalidate_cbk] 0-gvAA01-dht: dict is null
>
> Any ideas what this could mean? I am basically just grasping at straws
here.
>
> I am going to hold off on the version upgrade until I know there are no
files which need healing, which could be a while, from some reading I've
done there shouldn't be any issues with this as both are on v3.12.x?
>
> I've free'd up a small amount of space, but I still need to work on
this further.?
>
> I've read of a command "find .glusterfs -type f -links -2 -exec rm
{} \;" which could be run on each brick and it would potentially clean up
any files which were deleted straight from the bricks, but not via the client, I
have a feeling this could help me free up about 5-10TB per brick from what
I've been told about the history of this cluster. Can anyone confirm if this
is actually safe to run??
>
> At this stage, I'm open to any suggestions as to how to proceed, thanks
again for any advice.?
>
> Cheers,?
>
> - Patrick
>
> On Sun, Apr 21, 2019 at 1:22 AM Darrell Budic <budic at
onholyground.com> wrote:
>>
>> Patrick,
>>
>> Sounds like progress. Be aware that gluster is expected to max out the
CPUs on at least one of your servers while healing. This is normal and won?t
adversely affect overall performance (any more than having bricks in need of
healing, at any rate) unless you?re overdoing it. shd threads <= 4 should not
do that on your hardware. Other tunings may have also increased overall
performance, so you may see higher CPU than previously anyway. I?d recommend
upping those thread counts and letting it heal as fast as possible, especially
if these are dedicated Gluster storage servers (Ie: not also running VMs, etc).
You should see ?normal? CPU use one heals are completed. I see ~15-30% overall
normally, 95-98% while healing (x my 20 cores). It?s also likely to be different
between your servers, in a pure replica, one tends to max and one tends to be a
little higher, in a distributed-replica, I?d expect more than one to run harder
while healin-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190421/0c3d79a7/attachment.html>
Gluster users - Apr 2019 - Extremely slow cluster performance

[Gluster-users] Extremely slow cluster performance