Thanks again,
I have tried to run a find over the cluster to try and trigger
self-healing, but it's very slow so I don't have it running right now.
If I check the same "ls /brick/folder" on all bricks, it takes less
than
0.01 sec so I don't think any individual brick is causing the problem,
performance on each brick seems to be normal.
I think the issue is somewhere in the gluster internal communication as I
believe FUSE mounted clients will try to communicate with all bricks.
Unfortunately, I am not sure how to confirm this or narrow this down.
Really struggling with this one now, it's starting to significantly impact
our operations. I'm not sure what else I can try so appreciate any
suggestions.
Thank you,
- Patrick
On Sun, Apr 21, 2019 at 11:50 PM Strahil <hunter86_bg at yahoo.com> wrote:
> Usually when this happens I run '/find /fuse/mount/point -exec stat {}
> \;' from a client (using gluster with oVirt).
> Yet, my scale is multiple times smaller and I don't know how this will
> affect you (except it will trigger a heal).
>
> So the round-robin of the DNS clarifies the mystery .In such case, maybe
> FUSE client is not the problem.Still it is worth trying a VM with the new
> gluster version to mount the cluster.
>
> From the profile (took a short glance over it from my phone), not all
> bricks are spending much of their time in LOOKUP.
> Maybe your data is not evenly distributed? Is that ever possible ?
> Sadly you can't rebalance untill all those heals are pending.(Maybe
I'm
> wrong)
>
> Have you checked the speed of 'ls /my/brick/subdir1/' on each
brick ?
>
> Sadly, I'm just a gluster user, so take everything with a grain of
salt.
>
> Best Regards,
> Strahil Nikolov
> On Apr 21, 2019 18:03, Patrick Rennie <patrickmrennie at gmail.com>
wrote:
>
> I just tried to check my "gluster volume heal gvAA01 statistics"
and it
> doesn't seem like a full heal was still in progress, just an index, I
have
> started the full heal again and am trying to monitor it with "gluster
> volume heal gvAA01 info" which just shows me thousands of gfid file
> identifiers scrolling past.
> What is the best way to check the status of a heal and track the files
> healed and progress to completion?
>
> Thank you,
> - Patrick
>
> On Sun, Apr 21, 2019 at 10:28 PM Patrick Rennie <patrickmrennie at
gmail.com>
> wrote:
>
> I think just worked out why NFS lookups are sometimes slow and sometimes
> fast as the hostname uses round robin DNS lookups, if I change to a
> specific host, 01-B, it's always quick, and if I change to the other
brick
> host, 02-B, it's always slow.
> Maybe that will help to narrow this down?
>
> On Sun, Apr 21, 2019 at 10:24 PM Patrick Rennie <patrickmrennie at
gmail.com>
> wrote:
>
> Hi Strahil,
>
> Thank you for your reply and your suggestions. I'm not sure which logs
> would be most relevant to be checking to diagnose this issue, we have the
> brick logs, the cluster mount logs, the shd logs or something else? I have
> posted a few that I have seen repeated a few times already. I will continue
> to post anything further that I see.
> I am working on migrating data to some new storage, so this will slowly
> free up space, although this is a production cluster and new data is being
> uploaded every day, sometimes faster than I can migrate it off. I have
> several other similar clusters and none of them have the same problem, one
> the others is actually at 98-99% right now (big problem, I know) but still
> performs perfectly fine compared to this cluster, I am not sure low space
> is the root cause here.
>
> I currently have 13 VMs accessing this cluster, I have checked each one
> and all of them use one of the two options below to mount the cluster in
> fstab
>
> HOSTNAME:/gvAA01 /mountpoint glusterfs
>
defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable,use-readdirp=no
> 0 0
> HOSTNAME:/gvAA01 /mountpoint glusterfs
> defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable
>
> I also have a few other VMs which use NFS to access the cluster, and these
> machines appear to be significantly quicker, initially I get a similar
> delay with NFS but if I cancel the first "ls" and try it again I
get < 1
> sec lookups, this can take over 10 minutes by FUSE/gluster client, but the
> same trick of cancelling and trying again doesn't work for
FUSE/gluster.
> Sometimes the NFS queries have no delay at all, so this is a bit strange to
> me.
> HOSTNAME:/gvAA01 /mountpoint/ nfs
> defaults,_netdev,vers=3,async,noatime 0 0
>
> Example:
> user at VM:~$ time ls /cluster/folder
> ^C
>
> real 9m49.383s
> user 0m0.001s
> sys 0m0.010s
>
> user at VM:~$ time ls /cluster/folder
> <results>
>
> real 0m0.069s
> user 0m0.001s
> sys 0m0.007s
>
> ---
>
> I have checked the profiling as you suggested, I let it run for around a
> minute, then cancelled it and saved the profile info.
>
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 start
> Starting volume profile on gvAA01 has been successful
> root at HOSTNAME:/var/log/glusterfs# time ls /cluster/folder
> ^C
>
> real 1m1.660s
> user 0m0.000s
> sys 0m0.002s
>
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 info
>>
> ~/profile.txt
> root at HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 stop
>
> I will attach the results to this email as it's o
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190422/832f8d98/attachment.html>