thr3ads.net - Gluster users - [Gluster-users] Extremely slow cluster performance [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Patrick Rennie

2019-Apr-20 13:22 UTC

[Gluster-users] Extremely slow cluster performance

Hello Gluster Users,

I am hoping someone can help me with resolving an ongoing issue I've been
having, I'm new to mailing lists so forgive me if I have gotten anything
wrong. We have noticed our performance deteriorating over the last few
weeks, easily measured by trying to do an ls on one of our top-level
folders, and timing it, which usually would take 2-5 seconds, and now takes
up to 20 minutes, which obviously renders our cluster basically unusable.
This has been intermittent in the past but is now almost constant and I am
not sure how to work out the exact cause. We have noticed some errors in
the brick logs, and have noticed that if we kill the right brick process,
performance instantly returns back to normal, this is not always the same
brick, but it indicates to me something in the brick processes or
background tasks may be causing extreme latency. Due to this ability to fix
it by killing the right brick process off, I think it's a specific file, or
folder, or operation which may be hanging and causing the increased
latency, but I am not sure how to work it out. One last thing to add is
that our bricks are getting quite full (~95% full), we are trying to
migrate data off to new storage but that is going slowly, not helped by
this issue. I am currently trying to run a full heal as there appear to be
many files needing healing, and I have all brick processes running so they
have an opportunity to heal, but this means performance is very poor. It
currently takes over 15-20 minutes to do an ls of one of our top-level
folders, which just contains 60-80 other folders, this should take 2-5
seconds. This is all being checked by FUSE mount locally on the storage
node itself, but it is the same for other clients and VMs accessing the
cluster. Initially, it seemed our NFS mounts were not affected and operated
at normal speed, but testing over the last day has shown that our NFS
clients are also extremely slow, so it doesn't seem specific to FUSE as I
first thought it might be.

I am not sure how to proceed from here, I am fairly new to gluster having
inherited this setup from my predecessor and trying to keep it going. I
have included some info below to try and help with diagnosis, please let me
know if any further info would be helpful. I would really appreciate any
advice on what I could try to work out the cause. Thank you in advance for
reading this, and any suggestions you might be able to offer.

- Patrick

This is an example of the main error I see in our brick logs, there have
been others, I can post them when I see them again too:
[2019-04-20 04:54:43.055680] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick1/<filename> library: system.posix_acl_default  [Operation not
supported]
[2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr]
0-gvAA01-posix: Extended attributes not supported (try remounting brick
with 'user_xattr' flag)

Our setup consists of 2 storage nodes and an arbiter node. I have noticed
our nodes are on slightly different versions, I'm not sure if this could be
an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools -
total capacity is around 560TB.
We have bonded 10gbps NICS on each node, and I have tested bandwidth with
iperf and found that it's what would be expected from this config.
Individual brick performance seems ok, I've tested several bricks using dd
and can write a 10GB files at 1.7GB/s.

# dd if=/dev/zero of=/brick1/test/test.file bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s

Node 1:
# glusterfs --version
glusterfs 3.12.15

Node 2:
# glusterfs --version
glusterfs 3.12.14

Arbiter:
# glusterfs --version
glusterfs 3.12.14

Here is our gluster volume status:

# gluster volume status
Status of volume: gvAA01
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 01-B:/brick1/gvAA01/brick    49152     0          Y       7219
Brick 02-B:/brick1/gvAA01/brick    49152     0          Y       21845
Brick 00-A:/arbiterAA01/gvAA01/bri
ck1                                         49152     0          Y
 6931
Brick 01-B:/brick2/gvAA01/brick    49153     0          Y       7239
Brick 02-B:/brick2/gvAA01/brick    49153     0          Y       9916
Brick 00-A:/arbiterAA01/gvAA01/bri
ck2                                         49153     0          Y
 6939
Brick 01-B:/brick3/gvAA01/brick    49154     0          Y       7235
Brick 02-B:/brick3/gvAA01/brick    49154     0          Y       21858
Brick 00-A:/arbiterAA01/gvAA01/bri
ck3                                         49154     0          Y
 6947
Brick 01-B:/brick4/gvAA01/brick    49155     0          Y       31840
Brick 02-B:/brick4/gvAA01/brick    49155     0          Y       9933
Brick 00-A:/arbiterAA01/gvAA01/bri
ck4                                         49155     0          Y
 6956
Brick 01-B:/brick5/gvAA01/brick    49156     0          Y       7233
Brick 02-B:/brick5/gvAA01/brick    49156     0          Y       9942
Brick 00-A:/arbiterAA01/gvAA01/bri
ck5                                         49156     0          Y
 6964
Brick 01-B:/brick6/gvAA01/brick    49157     0          Y       7234
Brick 02-B:/brick6/gvAA01/brick    49157     0          Y       9952
Brick 00-A:/arbiterAA01/gvAA01/bri
ck6                                         49157     0          Y
 6974
Brick 01-B:/brick7/gvAA01/brick    49158     0          Y       7248
Brick 02-B:/brick7/gvAA01/brick    49158     0          Y       9960
Brick 00-A:/arbiterAA01/gvAA01/bri
ck7                                         49158     0          Y
 6984
Brick 01-B:/brick8/gvAA01/brick    49159     0          Y       7253
Brick 02-B:/brick8/gvAA01/brick    49159     0          Y       9970
Brick 00-A:/arbiterAA01/gvAA01/bri
ck8                                         49159     0          Y
 6993
Brick 01-B:/brick9/gvAA01/brick    49160     0          Y       7245
Brick 02-B:/brick9/gvAA01/brick    49160     0          Y       9984
Brick 00-A:/arbiterAA01/gvAA01/bri
ck9                                         49160     0          Y
 7001
NFS Server on localhost                     2049      0          Y
 17276
Self-heal Daemon on localhost               N/A       N/A        Y
 25245
NFS Server on 02-B                 2049      0          Y       9089
Self-heal Daemon on 02-B           N/A       N/A        Y       17838
NFS Server on 00-a                 2049      0          Y       15660
Self-heal Daemon on 00-a           N/A       N/A        Y       16218

Task Status of Volume gvAA01
------------------------------------------------------------------------------
There are no active volume tasks

And gluster volume info:

# gluster volume info

Volume Name: gvAA01
Type: Distributed-Replicate
Volume ID: ca4ece2c-13fe-414b-856c-2878196d6118
Status: Started
Snapshot Count: 0
Number of Bricks: 9 x (2 + 1) = 27
Transport-type: tcp
Bricks:
Brick1: 01-B:/brick1/gvAA01/brick
Brick2: 02-B:/brick1/gvAA01/brick
Brick3: 00-A:/arbiterAA01/gvAA01/brick1 (arbiter)
Brick4: 01-B:/brick2/gvAA01/brick
Brick5: 02-B:/brick2/gvAA01/brick
Brick6: 00-A:/arbiterAA01/gvAA01/brick2 (arbiter)
Brick7: 01-B:/brick3/gvAA01/brick
Brick8: 02-B:/brick3/gvAA01/brick
Brick9: 00-A:/arbiterAA01/gvAA01/brick3 (arbiter)
Brick10: 01-B:/brick4/gvAA01/brick
Brick11: 02-B:/brick4/gvAA01/brick
Brick12: 00-A:/arbiterAA01/gvAA01/brick4 (arbiter)
Brick13: 01-B:/brick5/gvAA01/brick
Brick14: 02-B:/brick5/gvAA01/brick
Brick15: 00-A:/arbiterAA01/gvAA01/brick5 (arbiter)
Brick16: 01-B:/brick6/gvAA01/brick
Brick17: 02-B:/brick6/gvAA01/brick
Brick18: 00-A:/arbiterAA01/gvAA01/brick6 (arbiter)
Brick19: 01-B:/brick7/gvAA01/brick
Brick20: 02-B:/brick7/gvAA01/brick
Brick21: 00-A:/arbiterAA01/gvAA01/brick7 (arbiter)
Brick22: 01-B:/brick8/gvAA01/brick
Brick23: 02-B:/brick8/gvAA01/brick
Brick24: 00-A:/arbiterAA01/gvAA01/brick8 (arbiter)
Brick25: 01-B:/brick9/gvAA01/brick
Brick26: 02-B:/brick9/gvAA01/brick
Brick27: 00-A:/arbiterAA01/gvAA01/brick9 (arbiter)
Options Reconfigured:
cluster.shd-max-threads: 4
performance.least-prio-threads: 16
cluster.readdir-optimize: on
performance.quick-read: off
performance.stat-prefetch: off
cluster.data-self-heal: on
cluster.lookup-unhashed: auto
cluster.lookup-optimize: on
cluster.favorite-child-policy: mtime
server.allow-insecure: on
transport.address-family: inet
client.bind-insecure: on
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
performance.md-cache-timeout: 600
cluster.self-heal-daemon: enable
performance.readdir-ahead: on
diagnostics.brick-log-level: INFO
nfs.disable: off

Thank you for any assistance.

- Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190420/49ac3425/attachment.html>

Darrell Budic

2019-Apr-20 14:50 UTC

head link

[Gluster-users] Extremely slow cluster performance

Patrick,

I would definitely upgrade your two nodes from 3.12.14 to 3.12.15. You also
mention ZFS, and that error you show makes me think you need to check to be sure
you have ?xattr=sa? and ?acltype=posixacl? set on your ZFS volumes.

You also observed your bricks are crossing the 95% full line, ZFS performance
will degrade significantly the closer you get to full. In my experience, this
starts somewhere between 10% and 5% free space remaining, so you?re in that
realm.

How?s your free memory on the servers doing? Do you have your zfs arc cache
limited to something less than all the RAM? It shares pretty well, but I?ve
encountered situations where other things won?t try and take ram back properly
if they think it?s in use, so ZFS never gets the opportunity to give it up.

Since your volume is a disperse-replica, you might try tuning
disperse.shd-max-threads, default is 1, I?d try it at 2, 4, or even more if the
CPUs are beefy enough. And setting server.event-threads to 4 and
client.event-threads to 8 has proven helpful in many cases. After you get
upgraded to 3.12.15, enabling performance.stat-prefetch may help as well. I
don?t know if it matters, but I?d also recommend resetting
performance.least-prio-threads to the default of 1 (or try 2 or 4) and/or also
setting performance.io <http://performance.io/>-thread-count to 32 if
those have beefy CPUs.

Beyond those general ideas, more info about your hardware (CPU and RAM) and
workload (VMs, direct storage for web servers or enders, etc) may net you some
more ideas. Then you?re going to have to do more digging into brick logs looking
for errors and/or warnings to see what?s going on.

  -Darrell

> On Apr 20, 2019, at 8:22 AM, Patrick Rennie <patrickmrennie at
gmail.com> wrote:
> 
> Hello Gluster Users, 
> 
> I am hoping someone can help me with resolving an ongoing issue I've
been having, I'm new to mailing lists so forgive me if I have gotten
anything wrong. We have noticed our performance deteriorating over the last few
weeks, easily measured by trying to do an ls on one of our top-level folders,
and timing it, which usually would take 2-5 seconds, and now takes up to 20
minutes, which obviously renders our cluster basically unusable. This has been
intermittent in the past but is now almost constant and I am not sure how to
work out the exact cause. We have noticed some errors in the brick logs, and
have noticed that if we kill the right brick process, performance instantly
returns back to normal, this is not always the same brick, but it indicates to
me something in the brick processes or background tasks may be causing extreme
latency. Due to this ability to fix it by killing the right brick process off, I
think it's a specific file, or folder, or operation which may be hanging and
causing the increased latency, but I am not sure how to work it out. One last
thing to add is that our bricks are getting quite full (~95% full), we are
trying to migrate data off to new storage but that is going slowly, not helped
by this issue. I am currently trying to run a full heal as there appear to be
many files needing healing, and I have all brick processes running so they have
an opportunity to heal, but this means performance is very poor. It currently
takes over 15-20 minutes to do an ls of one of our top-level folders, which just
contains 60-80 other folders, this should take 2-5 seconds. This is all being
checked by FUSE mount locally on the storage node itself, but it is the same for
other clients and VMs accessing the cluster. Initially, it seemed our NFS mounts
were not affected and operated at normal speed, but testing over the last day
has shown that our NFS clients are also extremely slow, so it doesn't seem
specific to FUSE as I first thought it might be.
> 
> I am not sure how to proceed from here, I am fairly new to gluster having
inherited this setup from my predecessor and trying to keep it going. I have
included some info below to try and help with diagnosis, please let me know if
any further info would be helpful. I would really appreciate any advice on what
I could try to work out the cause. Thank you in advance for reading this, and
any suggestions you might be able to offer.
> 
> - Patrick
> 
> This is an example of the main error I see in our brick logs, there have
been others, I can post them when I see them again too:
> [2019-04-20 04:54:43.055680] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick1/<filename> library: system.posix_acl_default  [Operation not
supported]
> [2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr]
0-gvAA01-posix: Extended attributes not supported (try remounting brick with
'user_xattr' flag)
> 
> Our setup consists of 2 storage nodes and an arbiter node. I have noticed
our nodes are on slightly different versions, I'm not sure if this could be
an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools - total
capacity is around 560TB.
> We have bonded 10gbps NICS on each node, and I have tested bandwidth with
iperf and found that it's what would be expected from this config.
> Individual brick performance seems ok, I've tested several bricks using
dd and can write a 10GB files at 1.7GB/s.
> 
> # dd if=/dev/zero of=/brick1/test/test.file bs=1M count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s
> 
> Node 1:
> # glusterfs --version
> glusterfs 3.12.15
> 
> Node 2:
> # glusterfs --version
> glusterfs 3.12.14
> 
> Arbiter:
> # glusterfs --version
> glusterfs 3.12.14
> 
> Here is our gluster volume status:
> 
> # gluster volume status
> Status of volume: gvAA01
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick 01-B:/brick1/gvAA01/brick    49152     0          Y       7219
> Brick 02-B:/brick1/gvAA01/brick    49152     0          Y       21845
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck1                                         49152     0          Y      
6931
> Brick 01-B:/brick2/gvAA01/brick    49153     0          Y       7239
> Brick 02-B:/brick2/gvAA01/brick    49153     0          Y       9916
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck2                                         49153     0          Y      
6939
> Brick 01-B:/brick3/gvAA01/brick    49154     0          Y       7235
> Brick 02-B:/brick3/gvAA01/brick    49154     0          Y       21858
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck3                                         49154     0          Y      
6947
> Brick 01-B:/brick4/gvAA01/brick    49155     0          Y       31840
> Brick 02-B:/brick4/gvAA01/brick    49155     0          Y       9933
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck4                                         49155     0          Y      
6956
> Brick 01-B:/brick5/gvAA01/brick    49156     0          Y       7233
> Brick 02-B:/brick5/gvAA01/brick    49156     0          Y       9942
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck5                                         49156     0          Y      
6964
> Brick 01-B:/brick6/gvAA01/brick    49157     0          Y       7234
> Brick 02-B:/brick6/gvAA01/brick    49157     0          Y       9952
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck6                                         49157     0          Y      
6974
> Brick 01-B:/brick7/gvAA01/brick    49158     0          Y       7248
> Brick 02-B:/brick7/gvAA01/brick    49158     0          Y       9960
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck7                                         49158     0          Y      
6984
> Brick 01-B:/brick8/gvAA01/brick    49159     0          Y       7253
> Brick 02-B:/brick8/gvAA01/brick    49159     0          Y       9970
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck8                                         49159     0          Y      
6993
> Brick 01-B:/brick9/gvAA01/brick    49160     0          Y       7245
> Brick 02-B:/brick9/gvAA01/brick    49160     0          Y       9984
> Brick 00-A:/arbiterAA01/gvAA01/bri
> ck9                                         49160     0          Y      
7001
> NFS Server on localhost                     2049      0          Y      
17276
> Self-heal Daemon on localhost               N/A       N/A        Y      
25245
> NFS Server on 02-B                 2049      0          Y       9089
> Self-heal Daemon on 02-B           N/A       N/A        Y       17838
> NFS Server on 00-a                 2049      0          Y       15660
> Self-heal Daemon on 00-a           N/A       N/A        Y       16218
> 
> Task Status of Volume gvAA01
>
------------------------------------------------------------------------------
> There are no active volume tasks
> 
> And gluster volume info: 
> 
> # gluster volume info
> 
> Volume Name: gvAA01
> Type: Distributed-Replicate
> Volume ID: ca4ece2c-13fe-414b-856c-2878196d6118
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 9 x (2 + 1) = 27
> Transport-type: tcp
> Bricks:
> Brick1: 01-B:/brick1/gvAA01/brick
> Brick2: 02-B:/brick1/gvAA01/brick
> Brick3: 00-A:/arbiterAA01/gvAA01/brick1 (arbiter)
> Brick4: 01-B:/brick2/gvAA01/brick
> Brick5: 02-B:/brick2/gvAA01/brick
> Brick6: 00-A:/arbiterAA01/gvAA01/brick2 (arbiter)
> Brick7: 01-B:/brick3/gvAA01/brick
> Brick8: 02-B:/brick3/gvAA01/brick
> Brick9: 00-A:/arbiterAA01/gvAA01/brick3 (arbiter)
> Brick10: 01-B:/brick4/gvAA01/brick
> Brick11: 02-B:/brick4/gvAA01/brick
> Brick12: 00-A:/arbiterAA01/gvAA01/brick4 (arbiter)
> Brick13: 01-B:/brick5/gvAA01/brick
> Brick14: 02-B:/brick5/gvAA01/brick
> Brick15: 00-A:/arbiterAA01/gvAA01/brick5 (arbiter)
> Brick16: 01-B:/brick6/gvAA01/brick
> Brick17: 02-B:/brick6/gvAA01/brick
> Brick18: 00-A:/arbiterAA01/gvAA01/brick6 (arbiter)
> Brick19: 01-B:/brick7/gvAA01/brick
> Brick20: 02-B:/brick7/gvAA01/brick
> Brick21: 00-A:/arbiterAA01/gvAA01/brick7 (arbiter)
> Brick22: 01-B:/brick8/gvAA01/brick
> Brick23: 02-B:/brick8/gvAA01/brick
> Brick24: 00-A:/arbiterAA01/gvAA01/brick8 (arbiter)
> Brick25: 01-B:/brick9/gvAA01/brick
> Brick26: 02-B:/brick9/gvAA01/brick
> Brick27: 00-A:/arbiterAA01/gvAA01/brick9 (arbiter)
> Options Reconfigured:
> cluster.shd-max-threads: 4
> performance.least-prio-threads: 16
> cluster.readdir-optimize: on
> performance.quick-read: off
> performance.stat-prefetch: off
> cluster.data-self-heal: on
> cluster.lookup-unhashed: auto
> cluster.lookup-optimize: on
> cluster.favorite-child-policy: mtime
> server.allow-insecure: on
> transport.address-family: inet
> client.bind-insecure: on
> cluster.entry-self-heal: off
> cluster.metadata-self-heal: off
> performance.md-cache-timeout: 600
> cluster.self-heal-daemon: enable
> performance.readdir-ahead: on
> diagnostics.brick-log-level: INFO
> nfs.disable: off
> 
> Thank you for any assistance. 
> 
> - Patrick
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190420/8f251564/attachment.html>

Gluster users - Apr 2019 - Extremely slow cluster performance

[Gluster-users] Extremely slow cluster performance

[Gluster-users] Extremely slow cluster performance