thr3ads.net - Gluster users - [Gluster-users] Extremely slow cluster performance [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Strahil

2019-Apr-21 15:50 UTC

[Gluster-users] Extremely slow cluster performance

Usually when this happens I run '/find  /fuse/mount/point -exec stat {}
\;' from a client (using gluster with oVirt).
Yet, my scale is multiple times smaller  and I don't know how this will
affect you (except it will trigger a heal).

So the round-robin of the DNS clarifies the mystery .In such case, maybe FUSE
client is not the problem.Still it is worth trying a VM with the new gluster
version to mount the cluster.

From the profile (took a short glance over it from my phone), not all bricks are
spending much of their time in LOOKUP.
Maybe your data is not evenly distributed? Is that ever possible ?
Sadly you can't rebalance untill all those heals are pending.(Maybe I'm
wrong)

Have you checked the speed  of 'ls /my/brick/subdir1/' on each brick ?

Sadly, I'm just a gluster user, so take everything with a grain of salt.

Best Regards,
Strahil NikolovOn Apr 21, 2019 18:03, Patrick Rennie <patrickmrennie at
gmail.com> wrote:>
> I just tried to check my "gluster volume heal gvAA01 statistics"
and it doesn't seem like a full heal was still in progress, just an index, I
have started the full heal again and am trying to monitor it with "gluster
volume heal gvAA01 info" which just shows me thousands of gfid file
identifiers scrolling past.?
> What is the best way to check the status of a heal and track the files
healed and progress to completion??
>
> Thank you,
> - Patrick
>
> On Sun, Apr 21, 2019 at 10:28 PM Patrick Rennie <patrickmrennie at
gmail.com> wrote:
>>
>> I think just worked out why NFS lookups are sometimes slow and
sometimes fast as the hostname uses round robin DNS lookups, if I change to a
specific host, 01-B, it's always quick, and if I change to the other brick
host, 02-B, it's always slow.?
>> Maybe that will help to narrow this down??
>>
>> On Sun, Apr 21, 2019 at 10:24 PM Patrick Rennie <patrickmrennie at
gmail.com> wrote:
>>>
>>> Hi Strahil,?
>>>
>>> Thank you for your reply and your suggestions. I'm not sure
which logs would be most relevant to be checking to diagnose this issue, we have
the brick logs, the cluster mount logs, the shd logs or something else? I have
posted a few that I have seen repeated a few times already. I will continue to
post anything further that I see.?
>>> I am working on migrating data to some new storage, so this will
slowly free up space, although this is a production cluster and new data is
being uploaded every day, sometimes faster than I can migrate it off. I have
several other similar clusters and none of them have the same problem, one the
others is actually at 98-99% right now (big problem, I know) but still performs
perfectly fine compared to this cluster, I am not sure low space is the root
cause here.?
>>>
>>> I currently have 13 VMs accessing this cluster, I have checked each
one and all of them use one of the two options below to mount the cluster in
fstab
>>>
>>> HOSTNAME:/gvAA01? ?/mountpoint? ? glusterfs? ? ?
?defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable,use-readdirp=no? ?
0 0
>>> HOSTNAME:/gvAA01? ?/mountpoint? ? glusterfs? ? ?
?defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable
>>>
>>> I also have a few other VMs which use NFS to access the cluster,
and these machines appear to be significantly quicker, initially I get a similar
delay with NFS but if I cancel the first "ls" and try it again I get
< 1 sec lookups, this can take over 10 minutes by FUSE/gluster client, but
the same trick of cancelling and trying again doesn't work for FUSE/gluster.
Sometimes the NFS queries have no delay at all, so this is a bit strange to me.?
>>> HOSTNAME:/gvAA01? ? ? ? /mountpoint/ nfs
defaults,_netdev,vers=3,async,noatime 0 0
>>>
>>> Example:
>>> user at VM:~$ time ls /cluster/folder
>>> ^C
>>>
>>> real? ? 9m49.383s
>>> user? ? 0m0.001s
>>> sys? ? ?0m0.010s
>>>
>>> user at VM:~$ time ls /cluster/folder
>>> <results>
>>>
>>> real? ? 0m0.069s
>>> user? ? 0m0.001s
>>> sys? ? ?0m0.007s
>>>
>>> ---
>>>
>>> I have checked the profiling as you suggested, I let it run for
around a minute, then cancelled it and saved the profile info.?
>>>
>>> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01
start
>>> Starting volume profile on gvAA01 has been successful
>>> root at HOSTNAME:/var/log/glusterfs# time ls /cluster/folder
>>> ^C
>>>
>>> real? ? 1m1.660s
>>> user? ? 0m0.000s
>>> sys? ? ?0m0.002s
>>>
>>> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01
info >> ~/profile.txt
>>> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01
stop
>>>
>>> I will attach the results to this email as it's o-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190421/c1b16263/attachment.html>

Patrick Rennie

2019-Apr-21 16:33 UTC

head link

[Gluster-users] Extremely slow cluster performance

Thanks again,
I have tried to run a find over the cluster to try and trigger
self-healing, but it's very slow so I don't have it running right now.
If I check the same "ls /brick/folder" on all bricks, it takes less
than
0.01 sec so I don't think any individual brick is causing the problem,
performance on each brick seems to be normal.
I think the issue is somewhere in the gluster internal communication as I
believe FUSE mounted clients will try to communicate with all bricks.
Unfortunately, I am not sure how to confirm this or narrow this down.
Really struggling with this one now, it's starting to significantly impact
our operations. I'm not sure what else I can try so appreciate any
suggestions.

Thank you,
- Patrick

On Sun, Apr 21, 2019 at 11:50 PM Strahil <hunter86_bg at yahoo.com> wrote:
> Usually when this happens I run '/find  /fuse/mount/point -exec stat {}
> \;' from a client (using gluster with oVirt).
> Yet, my scale is multiple times smaller  and I don't know how this will
> affect you (except it will trigger a heal).
>
> So the round-robin of the DNS clarifies the mystery .In such case, maybe
> FUSE client is not the problem.Still it is worth trying a VM with the new
> gluster version to mount the cluster.
>
> From the profile (took a short glance over it from my phone), not all
> bricks are spending much of their time in LOOKUP.
> Maybe your data is not evenly distributed? Is that ever possible ?
> Sadly you can't rebalance untill all those heals are pending.(Maybe
I'm
> wrong)
>
> Have you checked the speed  of 'ls /my/brick/subdir1/' on each
brick ?
>
> Sadly, I'm just a gluster user, so take everything with a grain of
salt.
>
> Best Regards,
> Strahil Nikolov
> On Apr 21, 2019 18:03, Patrick Rennie <patrickmrennie at gmail.com>
wrote:
>
> I just tried to check my "gluster volume heal gvAA01 statistics"
and it
> doesn't seem like a full heal was still in progress, just an index, I
have
> started the full heal again and am trying to monitor it with "gluster
> volume heal gvAA01 info" which just shows me thousands of gfid file
> identifiers scrolling past.
> What is the best way to check the status of a heal and track the files
> healed and progress to completion?
>
> Thank you,
> - Patrick
>
> On Sun, Apr 21, 2019 at 10:28 PM Patrick Rennie <patrickmrennie at
gmail.com>
> wrote:
>
> I think just worked out why NFS lookups are sometimes slow and sometimes
> fast as the hostname uses round robin DNS lookups, if I change to a
> specific host, 01-B, it's always quick, and if I change to the other
brick
> host, 02-B, it's always slow.
> Maybe that will help to narrow this down?
>
> On Sun, Apr 21, 2019 at 10:24 PM Patrick Rennie <patrickmrennie at
gmail.com>
> wrote:
>
> Hi Strahil,
>
> Thank you for your reply and your suggestions. I'm not sure which logs
> would be most relevant to be checking to diagnose this issue, we have the
> brick logs, the cluster mount logs, the shd logs or something else? I have
> posted a few that I have seen repeated a few times already. I will continue
> to post anything further that I see.
> I am working on migrating data to some new storage, so this will slowly
> free up space, although this is a production cluster and new data is being
> uploaded every day, sometimes faster than I can migrate it off. I have
> several other similar clusters and none of them have the same problem, one
> the others is actually at 98-99% right now (big problem, I know) but still
> performs perfectly fine compared to this cluster, I am not sure low space
> is the root cause here.
>
> I currently have 13 VMs accessing this cluster, I have checked each one
> and all of them use one of the two options below to mount the cluster in
> fstab
>
> HOSTNAME:/gvAA01   /mountpoint    glusterfs
> 
defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable,use-readdirp=no
>   0 0
> HOSTNAME:/gvAA01   /mountpoint    glusterfs
>  defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable
>
> I also have a few other VMs which use NFS to access the cluster, and these
> machines appear to be significantly quicker, initially I get a similar
> delay with NFS but if I cancel the first "ls" and try it again I
get < 1
> sec lookups, this can take over 10 minutes by FUSE/gluster client, but the
> same trick of cancelling and trying again doesn't work for
FUSE/gluster.
> Sometimes the NFS queries have no delay at all, so this is a bit strange to
> me.
> HOSTNAME:/gvAA01        /mountpoint/ nfs
> defaults,_netdev,vers=3,async,noatime 0 0
>
> Example:
> user at VM:~$ time ls /cluster/folder
> ^C
>
> real    9m49.383s
> user    0m0.001s
> sys     0m0.010s
>
> user at VM:~$ time ls /cluster/folder
> <results>
>
> real    0m0.069s
> user    0m0.001s
> sys     0m0.007s
>
> ---
>
> I have checked the profiling as you suggested, I let it run for around a
> minute, then cancelled it and saved the profile info.
>
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 start
> Starting volume profile on gvAA01 has been successful
> root at HOSTNAME:/var/log/glusterfs# time ls /cluster/folder
> ^C
>
> real    1m1.660s
> user    0m0.000s
> sys     0m0.002s
>
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 info
>>
> ~/profile.txt
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 stop
>
> I will attach the results to this email as it's o
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190422/832f8d98/attachment.html>

Gluster users - Apr 2019 - Extremely slow cluster performance

[Gluster-users] Extremely slow cluster performance

[Gluster-users] Extremely slow cluster performance