Patrick,
Sounds like progress. Be aware that gluster is expected to max out the CPUs on
at least one of your servers while healing. This is normal and won?t adversely
affect overall performance (any more than having bricks in need of healing, at
any rate) unless you?re overdoing it. shd threads <= 4 should not do that on
your hardware. Other tunings may have also increased overall performance, so you
may see higher CPU than previously anyway. I?d recommend upping those thread
counts and letting it heal as fast as possible, especially if these are
dedicated Gluster storage servers (Ie: not also running VMs, etc). You should
see ?normal? CPU use one heals are completed. I see ~15-30% overall normally,
95-98% while healing (x my 20 cores). It?s also likely to be different between
your servers, in a pure replica, one tends to max and one tends to be a little
higher, in a distributed-replica, I?d expect more than one to run harder while
healing.
Keep the differences between doing an ls on a brick and doing an ls on a gluster
mount in mind. When you do a ls on a gluster volume, it isn?t just doing a ls on
one brick, it?s effectively doing it on ALL of your bricks, and they all have to
return data before the ls succeeds. In a distributed volume, it?s figuring out
where on each volume things live and getting the stat() from each to assemble
the whole thing. And if things are in need of healing, it will take even longer
to decide which version is current and use it (shd triggers a heal anytime it
encounters this). Any of these things being slow slows down the overall
response.
At this point, I?d get some sleep too, and let your cluster heal while you do.
I?d really want it fully healed before I did any updates anyway, so let it use
CPU and get itself sorted out. Expect it to do a round of healing after you
upgrade each machine too, this is normal so don?t let the CPU spike surprise
you, It?s just catching up from the downtime incurred by the update and/or
reboot if you did one.
That reminds me, check your gluster cluster.op-version and
cluster.max-op-version (gluster vol get all all | grep op-version). If
op-version isn?t at the max-op-verison, set it to it so you?re taking advantage
of the latest features available to your version.
-Darrell
> On Apr 20, 2019, at 11:54 AM, Patrick Rennie <patrickmrennie at
gmail.com> wrote:
>
> Hi Darrell,
>
> Thanks again for your advice, I've applied the acltype=posixacl on my
zpools and I think that has reduced some of the noise from my brick logs.
> I also bumped up some of the thread counts you suggested but my CPU load
skyrocketed, so I dropped it back down to something slightly lower, but still
higher than it was before, and will see how that goes for a while.
>
> Although low space is a definite issue, if I run an ls anywhere on my
bricks directly it's instant, <1 second, and still takes several minutes
via gluster, so there is still a problem in my gluster configuration somewhere.
We don't have any snapshots, but I am trying to work out if any data on
there is safe to delete, or if there is any way I can safely find and delete
data which has been removed directly from the bricks in the past. I also have
lz4 compression already enabled on each zpool which does help a bit, we get
between 1.05 and 1.08x compression on this data.
> I've tried to go through each client and checked it's cluster mount
logs and also my brick logs and looking for errors, so far nothing is jumping
out at me, but there are some warnings and errors here and there, I am trying to
work out what they mean.
>
> It's already 1 am here and unfortunately, I'm still awake working
on this issue, but I think that I will have to leave the version upgrades until
tomorrow.
>
> Thanks again for your advice so far. If anyone has any ideas on where I can
look for errors other than brick logs or the cluster mount logs to help resolve
this issue, it would be much appreciated.
>
> Cheers,
>
> - Patrick
>
> On Sat, Apr 20, 2019 at 11:57 PM Darrell Budic <budic at
onholyground.com <mailto:budic at onholyground.com>> wrote:
> See inline:
>
>> On Apr 20, 2019, at 10:09 AM, Patrick Rennie <patrickmrennie at
gmail.com <mailto:patrickmrennie at gmail.com>> wrote:
>>
>> Hi Darrell,
>>
>> Thanks for your reply, this issue seems to be getting worse over the
last few days, really has me tearing my hair out. I will do as you have
suggested and get started on upgrading from 3.12.14 to 3.12.15.
>> I've checked the zfs properties and all bricks have
"xattr=sa" set, but none of them has "acltype=posixacl" set,
currently the acltype property shows "off", if I make these changes
will it apply retroactively to the existing data? I'm unfamiliar with what
this will change so I may need to look into that before I proceed.
>
> It is safe to apply that now, any new set/get calls will then use it if new
posixacls exist, and use older if not. ZFS is good that way. It should clear up
your posix_acl and posix errors over time.
>
>> I understand performance is going to slow down as the bricks get full,
I am currently trying to free space and migrate data to some newer storage, I
have fresh several hundred TB storage I just setup recently but with these
performance issues it's really slow. I also believe there is significant
data which has been deleted directly from the bricks in the past, so if I can
reclaim this space in a safe manner then I will have at least around 10-15% free
space.
>
> Full ZFS volumes will have a much larger impact on performance than you?d
think, I?d prioritize this. If you have been taking zfs snapshots, consider
deleting them to get the overall volume free space back up. And just to be sure
it?s been said, delete from within the mounted volumes, don?t delete directly
from the bricks (gluster will just try and heal it later, compounding your
issues). Does not apply to deleting other data from the ZFS volume if it?s not
part of the brick directory, of course.
>
>> These servers have dual 8 core Xeon (E5-2620v4) and 512GB of RAM so
generally they have plenty of resources available, currently only using around
330/512GB of memory.
>>
>> I will look into what your suggested settings will change, and then
will probably go ahead with your recommendations, for our specs as stated above,
what would you suggest for performance.io
<http://performance.io/>-thread-count ?
>
> I run single 2630v4s on my servers, which have a smaller storage footprint
than yours. I?d go with 32 for performance.io
<http://performance.io/>-thread-count. I?d try 4 for the shd thread
settings on that gear. Your memory use sounds fine, so no worries there.
>
>> Our workload is nothing too extreme, we have a few VMs which write
backup data to this storage nightly for our clients, our VMs don't live on
this cluster, but just write to it.
>
> If they are writing compressible data, you?ll get immediate benefit by
setting compression=lz4 on your ZFS volumes. It won?t help any old data, of
course, but it will compress new data going forward. This is another one that?s
safe to enable on the fly.
>
>> I've been going through all of the logs I can, below are some
slightly sanitized errors I've come across, but I'm not sure what to
make of them. The main error I am seeing is the first one below, across several
of my bricks, but possibly only for specific folders on the cluster, I'm not
100% about that yet though.
>>
>> [2019-04-20 05:56:59.512649] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick7/xxxxxxxxxxxxxxxxxxxx: system.posix_acl_default [Operation not
supported]
>> [2019-04-20 05:59:06.084333] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick7/xxxxxxxxxxxxxxxxxxxx: system.posix_acl_default [Operation not
supported]
>> [2019-04-20 05:59:43.289030] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick7/xxxxxxxxxxxxxxxxxxxx: system.posix_acl_default [Operation not
supported]
>> [2019-04-20 05:59:50.582257] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick7/xxxxxxxxxxxxxxxxxxxx: system.posix_acl_default [Operation not
supported]
>> [2019-04-20 06:01:42.501701] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick7/xxxxxxxxxxxxxxxxxxxx: system.posix_acl_default [Operation not
supported]
>> [2019-04-20 06:01:51.665354] W [posix.c:4929:posix_getxattr]
0-gvAA01-posix: Extended attributes not supported (try remounting brick with
'user_xattr' flag)
>>
>>
>> [2019-04-20 13:12:36.131856] E [MSGID: 113002]
[posix-helpers.c:893:posix_gfid_set] 0-gvAA01-posix: gfid is null for
/xxxxxxxxxxxxxxxxxxxx [Invalid argument]
>> [2019-04-20 13:12:36.131959] E [MSGID: 113002]
[posix.c:362:posix_lookup] 0-gvAA01-posix: buf->ia_gfid is null for
/brick2/xxxxxxxxxxxxxxxxxxxx_62906_tmp [No data available]
>> [2019-04-20 13:12:36.132016] E [MSGID: 115050]
[server-rpc-fops.c:175:server_lookup_cbk] 0-gvAA01-server: 24274759: LOOKUP
/xxxxxxxxxxxxxxxxxxxx (a7c9b4a0-b7ee-4d01-a79e-576013c8ac87/Cloud
Backup_clone1.vbm_62906_tmp), client:
00-A-16217-2019/04/08-21:23:03:692424-gvAA01-client-4-0-3, error-xlator:
gvAA01-posix [No data available]
>> [2019-04-20 13:12:38.093719] E [MSGID: 115050]
[server-rpc-fops.c:175:server_lookup_cbk] 0-gvAA01-server: 24276491: LOOKUP
/xxxxxxxxxxxxxxxxxxxx (a7c9b4a0-b7ee-4d01-a79e-576013c8ac87/Cloud
Backup_clone1.vbm_62906_tmp), client:
00-A-16217-2019/04/08-21:23:03:692424-gvAA01-client-4-0-3, error-xlator:
gvAA01-posix [No data available]
>> [2019-04-20 13:12:38.093660] E [MSGID: 113002]
[posix-helpers.c:893:posix_gfid_set] 0-gvAA01-posix: gfid is null for
/xxxxxxxxxxxxxxxxxxxx [Invalid argument]
>> [2019-04-20 13:12:38.093696] E [MSGID: 113002]
[posix.c:362:posix_lookup] 0-gvAA01-posix: buf->ia_gfid is null for
/brick2/xxxxxxxxxxxxxxxxxxxx [No data available]
>>
>
> posixacls should clear those up, as mentioned.
>
>>
>> [2019-04-20 14:25:59.654576] E [inodelk.c:404:__inode_unlock_lock]
0-gvAA01-locks: Matching lock not found for unlock 0-9223372036854775807, by
980fdbbd367f0000 on 0x7fc4f0161440
>> [2019-04-20 14:25:59.654668] E [MSGID: 115053]
[server-rpc-fops.c:295:server_inodelk_cbk] 0-gvAA01-server: 6092928: INODELK
/xxxxxxxxxxxxxxxxxxxx.cdr$ (25b14631-a179-4274-8243-6e272d4f2ad8), client:
cb-per-worker18-53637-2019/04/19-14:25:37:927673-gvAA01-client-1-0-4,
error-xlator: gvAA01-locks [Invalid argument]
>>
>>
>> [2019-04-20 13:35:07.495495] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x247c644, Program: GlusterFS 3.3,
ProgVers: 330, Proc: 27) to rpc-transport (tcp.gvAA01-server)
>> [2019-04-20 13:35:07.495619] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/debug/io-stats.so(+0x1696a)
[0x7ff4ae6f796a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/protocol/server.so(+0x2d6e8)
[0x7ff4ae2a96e8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/protocol/server.so(+0x928d)
[0x7ff4ae28528d] ) 0-: Reply submission failed
>>
>
> Fix the posix acls and see if these clear up over time as well, I?m unclear
on what the overall effect of running without the posix acls will be to total
gluster health. Your biggest problem sounds like you need to free up space on
the volumes and get the overall volume health back up to par and see if that
doesn?t resolve the symptoms you?re seeing.
>
>
>>
>> Thank you again for your assistance. It is greatly appreciated.
>>
>> - Patrick
>>
>>
>>
>> On Sat, Apr 20, 2019 at 10:50 PM Darrell Budic <budic at
onholyground.com <mailto:budic at onholyground.com>> wrote:
>> Patrick,
>>
>> I would definitely upgrade your two nodes from 3.12.14 to 3.12.15. You
also mention ZFS, and that error you show makes me think you need to check to be
sure you have ?xattr=sa? and ?acltype=posixacl? set on your ZFS volumes.
>>
>> You also observed your bricks are crossing the 95% full line, ZFS
performance will degrade significantly the closer you get to full. In my
experience, this starts somewhere between 10% and 5% free space remaining, so
you?re in that realm.
>>
>> How?s your free memory on the servers doing? Do you have your zfs arc
cache limited to something less than all the RAM? It shares pretty well, but
I?ve encountered situations where other things won?t try and take ram back
properly if they think it?s in use, so ZFS never gets the opportunity to give it
up.
>>
>> Since your volume is a disperse-replica, you might try tuning
disperse.shd-max-threads, default is 1, I?d try it at 2, 4, or even more if the
CPUs are beefy enough. And setting server.event-threads to 4 and
client.event-threads to 8 has proven helpful in many cases. After you get
upgraded to 3.12.15, enabling performance.stat-prefetch may help as well. I
don?t know if it matters, but I?d also recommend resetting
performance.least-prio-threads to the default of 1 (or try 2 or 4) and/or also
setting performance.io <http://performance.io/>-thread-count to 32 if
those have beefy CPUs.
>>
>> Beyond those general ideas, more info about your hardware (CPU and RAM)
and workload (VMs, direct storage for web servers or enders, etc) may net you
some more ideas. Then you?re going to have to do more digging into brick logs
looking for errors and/or warnings to see what?s going on.
>>
>> -Darrell
>>
>>
>>> On Apr 20, 2019, at 8:22 AM, Patrick Rennie <patrickmrennie at
gmail.com <mailto:patrickmrennie at gmail.com>> wrote:
>>>
>>> Hello Gluster Users,
>>>
>>> I am hoping someone can help me with resolving an ongoing issue
I've been having, I'm new to mailing lists so forgive me if I have
gotten anything wrong. We have noticed our performance deteriorating over the
last few weeks, easily measured by trying to do an ls on one of our top-level
folders, and timing it, which usually would take 2-5 seconds, and now takes up
to 20 minutes, which obviously renders our cluster basically unusable. This has
been intermittent in the past but is now almost constant and I am not sure how
to work out the exact cause. We have noticed some errors in the brick logs, and
have noticed that if we kill the right brick process, performance instantly
returns back to normal, this is not always the same brick, but it indicates to
me something in the brick processes or background tasks may be causing extreme
latency. Due to this ability to fix it by killing the right brick process off, I
think it's a specific file, or folder, or operation which may be hanging and
causing the increased latency, but I am not sure how to work it out. One last
thing to add is that our bricks are getting quite full (~95% full), we are
trying to migrate data off to new storage but that is going slowly, not helped
by this issue. I am currently trying to run a full heal as there appear to be
many files needing healing, and I have all brick processes running so they have
an opportunity to heal, but this means performance is very poor. It currently
takes over 15-20 minutes to do an ls of one of our top-level folders, which just
contains 60-80 other folders, this should take 2-5 seconds. This is all being
checked by FUSE mount locally on the storage node itself, but it is the same for
other clients and VMs accessing the cluster. Initially, it seemed our NFS mounts
were not affected and operated at normal speed, but testing over the last day
has shown that our NFS clients are also extremely slow, so it doesn't seem
specific to FUSE as I first thought it might be.
>>>
>>> I am not sure how to proceed from here, I am fairly new to gluster
having inherited this setup from my predecessor and trying to keep it going. I
have included some info below to try and help with diagnosis, please let me know
if any further info would be helpful. I would really appreciate any advice on
what I could try to work out the cause. Thank you in advance for reading this,
and any suggestions you might be able to offer.
>>>
>>> - Patrick
>>>
>>> This is an example of the main error I see in our brick logs, there
have been others, I can post them when I see them again too:
>>> [2019-04-20 04:54:43.055680] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick1/<filename> library: system.posix_acl_default [Operation not
supported]
>>> [2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr]
0-gvAA01-posix: Extended attributes not supported (try remounting brick with
'user_xattr' flag)
>>>
>>> Our setup consists of 2 storage nodes and an arbiter node. I have
noticed our nodes are on slightly different versions, I'm not sure if this
could be an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools -
total capacity is around 560TB.
>>> We have bonded 10gbps NICS on each node, and I have tested
bandwidth with iperf and found that it's what would be expected from this
config.
>>> Individual brick performance seems ok, I've tested several
bricks using dd and can write a 10GB files at 1.7GB/s.
>>>
>>> # dd if=/dev/zero of=/brick1/test/test.file bs=1M count=10000
>>> 10000+0 records in
>>> 10000+0 records out
>>> 10485760000 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s
>>>
>>> Node 1:
>>> # glusterfs --version
>>> glusterfs 3.12.15
>>>
>>> Node 2:
>>> # glusterfs --version
>>> glusterfs 3.12.14
>>>
>>> Arbiter:
>>> # glusterfs --version
>>> glusterfs 3.12.14
>>>
>>> Here is our gluster volume status:
>>>
>>> # gluster volume status
>>> Status of volume: gvAA01
>>> Gluster process TCP Port RDMA Port
Online Pid
>>>
------------------------------------------------------------------------------
>>> Brick 01-B:/brick1/gvAA01/brick 49152 0 Y
7219
>>> Brick 02-B:/brick1/gvAA01/brick 49152 0 Y
21845
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck1 49152 0 Y
6931
>>> Brick 01-B:/brick2/gvAA01/brick 49153 0 Y
7239
>>> Brick 02-B:/brick2/gvAA01/brick 49153 0 Y
9916
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck2 49153 0 Y
6939
>>> Brick 01-B:/brick3/gvAA01/brick 49154 0 Y
7235
>>> Brick 02-B:/brick3/gvAA01/brick 49154 0 Y
21858
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck3 49154 0 Y
6947
>>> Brick 01-B:/brick4/gvAA01/brick 49155 0 Y
31840
>>> Brick 02-B:/brick4/gvAA01/brick 49155 0 Y
9933
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck4 49155 0 Y
6956
>>> Brick 01-B:/brick5/gvAA01/brick 49156 0 Y
7233
>>> Brick 02-B:/brick5/gvAA01/brick 49156 0 Y
9942
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck5 49156 0 Y
6964
>>> Brick 01-B:/brick6/gvAA01/brick 49157 0 Y
7234
>>> Brick 02-B:/brick6/gvAA01/brick 49157 0 Y
9952
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck6 49157 0 Y
6974
>>> Brick 01-B:/brick7/gvAA01/brick 49158 0 Y
7248
>>> Brick 02-B:/brick7/gvAA01/brick 49158 0 Y
9960
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck7 49158 0 Y
6984
>>> Brick 01-B:/brick8/gvAA01/brick 49159 0 Y
7253
>>> Brick 02-B:/brick8/gvAA01/brick 49159 0 Y
9970
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck8 49159 0 Y
6993
>>> Brick 01-B:/brick9/gvAA01/brick 49160 0 Y
7245
>>> Brick 02-B:/brick9/gvAA01/brick 49160 0 Y
9984
>>> Brick 00-A:/arbiterAA01/gvAA01/bri
>>> ck9 49160 0 Y
7001
>>> NFS Server on localhost 2049 0 Y
17276
>>> Self-heal Daemon on localhost N/A N/A Y
25245
>>> NFS Server on 02-B 2049 0 Y
9089
>>> Self-heal Daemon on 02-B N/A N/A Y
17838
>>> NFS Server on 00-a 2049 0 Y
15660
>>> Self-heal Daemon on 00-a N/A N/A Y
16218
>>>
>>> Task Status of Volume gvAA01
>>>
------------------------------------------------------------------------------
>>> There are no active volume tasks
>>>
>>> And gluster volume info:
>>>
>>> # gluster volume info
>>>
>>> Volume Name: gvAA01
>>> Type: Distributed-Replicate
>>> Volume ID: ca4ece2c-13fe-414b-856c-2878196d6118
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 9 x (2 + 1) = 27
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 01-B:/brick1/gvAA01/brick
>>> Brick2: 02-B:/brick1/gvAA01/brick
>>> Brick3: 00-A:/arbiterAA01/gvAA01/brick1 (arbiter)
>>> Brick4: 01-B:/brick2/gvAA01/brick
>>> Brick5: 02-B:/brick2/gvAA01/brick
>>> Brick6: 00-A:/arbiterAA01/gvAA01/brick2 (arbiter)
>>> Brick7: 01-B:/brick3/gvAA01/brick
>>> Brick8: 02-B:/brick3/gvAA01/brick
>>> Brick9: 00-A:/arbiterAA01/gvAA01/brick3 (arbiter)
>>> Brick10: 01-B:/brick4/gvAA01/brick
>>> Brick11: 02-B:/brick4/gvAA01/brick
>>> Brick12: 00-A:/arbiterAA01/gvAA01/brick4 (arbiter)
>>> Brick13: 01-B:/brick5/gvAA01/brick
>>> Brick14: 02-B:/brick5/gvAA01/brick
>>> Brick15: 00-A:/arbiterAA01/gvAA01/brick5 (arbiter)
>>> Brick16: 01-B:/brick6/gvAA01/brick
>>> Brick17: 02-B:/brick6/gvAA01/brick
>>> Brick18: 00-A:/arbiterAA01/gvAA01/brick6 (arbiter)
>>> Brick19: 01-B:/brick7/gvAA01/brick
>>> Brick20: 02-B:/brick7/gvAA01/brick
>>> Brick21: 00-A:/arbiterAA01/gvAA01/brick7 (arbiter)
>>> Brick22: 01-B:/brick8/gvAA01/brick
>>> Brick23: 02-B:/brick8/gvAA01/brick
>>> Brick24: 00-A:/arbiterAA01/gvAA01/brick8 (arbiter)
>>> Brick25: 01-B:/brick9/gvAA01/brick
>>> Brick26: 02-B:/brick9/gvAA01/brick
>>> Brick27: 00-A:/arbiterAA01/gvAA01/brick9 (arbiter)
>>> Options Reconfigured:
>>> cluster.shd-max-threads: 4
>>> performance.least-prio-threads: 16
>>> cluster.readdir-optimize: on
>>> performance.quick-read: off
>>> performance.stat-prefetch: off
>>> cluster.data-self-heal: on
>>> cluster.lookup-unhashed: auto
>>> cluster.lookup-optimize: on
>>> cluster.favorite-child-policy: mtime
>>> server.allow-insecure: on
>>> transport.address-family: inet
>>> client.bind-insecure: on
>>> cluster.entry-self-heal: off
>>> cluster.metadata-self-heal: off
>>> performance.md-cache-timeout: 600
>>> cluster.self-heal-daemon: enable
>>> performance.readdir-ahead: on
>>> diagnostics.brick-log-level: INFO
>>> nfs.disable: off
>>>
>>> Thank you for any assistance.
>>>
>>> - Patrick
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190420/f5fa6ac4/attachment.html>