thr3ads.net - Gluster users - [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load)

If this information is useful, please help other people find it:
Share via:

Strahil Nikolov

2020-Mar-29 21:39 UTC

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

On March 29, 2020 7:10:49 AM GMT+03:00, Erik Jacobson <erik.jacobson at
hpe.com> wrote:>Hello all,
>
>I am getting split-brain errors in the gnfs nfs.log when 1 gluster
>server is down in a 3-brick/3-node gluster volume. It only happens
>under
>intense load.
>
>I reported this a few months ago but didn't have a repeatable test
>case.
>Since then, we got reports from the field and I was able to make a test
>case
>with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76
>nodes to one gnfs server to make the problem more likely to happen with
>the
>limited nodes we have in-house.
>
>We are using gluster nfs (ganesha is not yet reliable for our workload)
>to export an NFS filesystem that is used for a read-only root
>filesystem
>for NFS clients. The largest client count we have is 2592 across 9
>leaders (3 replicated subvolumes) - out in the field. This is where
>the problem was first reported.
>
>In the lab, I have a test case that can repeat the problem on a single
>subvolume cluster.
>
>Please forgive how ugly the test case is. I'm sure an IO test person
>can
>make it pretty. It basically runs a bunch of cluster-manger
>NFS-intensive
>operations while also producing other load. If one leader is down,
>nfs.log reports some split-brain errors. For real-world customers, the
>symptom is "some nodes failing to boot" in various ways or
"jobs
>failing
>to launch due to permissions or file read problems (like a library not
>being readable on one node)". If all leaders are up, we see no errors.
>
>As an attachment, I will include volume settings.
>
>Here are example nfs.log errors:
>
>
>[2020-03-29 03:42:52.295532] E [MSGID: 108008]
>[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0:
>Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1:
>split-brain observed. [Input/output error]
>[2020-03-29 03:42:52.295583] W [MSGID: 112199]
>[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3:
><gfid:9e721602-2732-4490-bde3-19cac6e33291>/bin/whoami => (XID:
>19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
>[2020-03-29 03:43:03.600023] E [MSGID: 108008]
>[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0:
>Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868:
>split-brain observed. [Input/output error]
>[2020-03-29 03:43:03.600075] W [MSGID: 112199]
>[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3:
><gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm
>=> (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output
>error))
>[2020-03-29 03:43:07.681294] E [MSGID: 108008]
>[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0:
>Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b:
>split-brain observed. [Input/output error]
>[2020-03-29 03:43:07.681339] W [MSGID: 112199]
>[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3:
><gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/.libhogweed.so.4.hmac
>=> (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output
>error)) target: (null)
>
>
>The brick log isn't very interesting during the failure. There are some
>ACL errors that don't seem to directly relate to the issue at hand.
>(I can attach if requested!)
>
>This is glusterfs72 (although we originally hit it with 4.1.6).
>I'm using rhel8 (although field reports are from rhel76).
>
>If there is anything the community can suggest to help me with this, it
>would really be appreciated. I'm getting unhappy reports from the field
>that the failover doesn't work as expected.
>
>I've tried tweaking several things from various threading settings to
>enabling md-cach-statfs to mem-factor to listen backlogs. I even tried
>adjusting the cluster.read-hash-mode and choose-local settings.
>
>"cluster-configuration" in the script initiates a bunch of
operations
>on the
>node that results in reading many files and doing some database
>queries. I
>used it in my test case as it is a common failure point when nodes are
>booting. This test case, although ugly, fails 100% if one server is
>down and
>works 100% if all servers are up.
>
>
>#! /bin/bash
>
>#
># Test case:
>#
># in a 1x3 Gluster Replicated setup with the HPCM volume settings..
>#
># On a cluster with 76 nodes (maybe can be replicated with less we
>don't
># know)
>#
># When all the nodes are assigned to one IP alias to get the load in to
># one leader node....
>#
># This test case will produce split-brain errors in the nfs.log file
># when 1 leader is down, but will run clean when all 3 are up.
>#
># It is not necessary to power off the leader you wish to disable.
>Simply
># running 'systemctl stop glusterd' is sufficient.
>#
># We will use this script to try to resolve the issue with split-brain
># under stress when one leader is down.
>#
>
># (compute group is 76 compute nodes)
>echo "killing any node find or node tar commands..."
>pdsh -f 500 -g compute killall find
>pdsh -f 500 -g compute killall tar
>
># (in this test, leader1 is known to have glusterd stopped for the test
>case)
>echo "stop, start glusterd, drop caches, sleep 15"
>set -x
>pdsh -w leader2,leader3 systemctl stop glusterd
>sleep 3
>pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches"
>pdsh -w leader2,leader3 systemctl start glusterd
>set +x
>sleep 15
>
>echo "drop caches on nodes"
>pdsh -f 500 -g compute "echo 3 > /proc/sys/vm/drop_caches"
>
>echo
>"----------------------------------------------------------------------"
>echo "test start"
>echo
>"----------------------------------------------------------------------"
>
>set -x
>
>
>pdsh -f 500 -g compute "tar cf - /usr > /dev/null" &
>pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
>pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
>pdsh -f 500 -g compute "find /usr > /dev/null" &
>pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
>pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
>wait
Hey Erik,

That's  odd.
As  far as  I know, the client's are accessing  one of the gluster nodes 
that serves as NFS server and then syncs data across the peers ,right?
What happens when the virtual IP(s) are  failed  over to the other gluster node?
Is the issue resolved?

Do you get any split brain entries via 'gluster volume geal <VOL>
info' ?

Also, what kind of  load balancing are you using ?

Best Regards,
Strahil Nikolov

Erik Jacobson

2020-Mar-30 01:01 UTC

head link

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Thank you for replying!! Responses below...

I have attached the volume def (meant to before).
I have attached a couple logs from one of the leaders.
> That's  odd.
> As  far as  I know, the client's are accessing  one of the gluster
nodes  that serves as NFS server and then syncs data across the peers ,right?
Correct, although in this case, with a 1x3, all of them should have
local copies. Our first reports came in from 3x3 (9 server) systems but
we have been able to duplicate on 1x3 thankfully in house. This is a
huge step forward as I had no reproducer previously.
> What happens when the virtual IP(s) are  failed  over to the other gluster
node? Is the issue resolved?
While we do use CTDB for managing the IPs aliases, I don't start the test
until
the IP is stabilized. I put all 76 nodes on one IP alias to make a more
similar load to what we have in the field.

I think it is important to point out that if I reduce the load, all is
well. For examples, if the test were just booting -- where the initial
reports were seen -- just 1 or 2 nodes out of 1,000 would have an issue
each cycle. They all boot the same way and are all using the same IP
alias for NFS in my test case. So I think the split-brain messages are maybe
a symptom of some sort of timeout ??? (making stuff up here).
> Also, what kind of  load balancing are you using ?[I moved this question up because the below answer has too much
output]

We are doing very simple balancing - manual balancing. As we add compute
nodes to the cluster, a couple racks are assigned to IP alias #1, the
next couple to IP alias #2, and so on. I'm happy to not have the
complexity of a real load balancer right now.

> Do you get any split brain entries via 'gluster volume geal <VOL>
info' ?
I ran two trials for the 'gluster volume heal ...'

Trial 1 - with all 3 servers up and while running the load:
[root at leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.5:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.6:/data/brick_cm_shared
Status: Connected
Number of entries: 0


Trial 2 - with 1 server down (stopped glusterd on 1 server) - and
without doing any testing yet -- I see this.  Let me explain though -
not in the error path, I am using RW NFS filesystem image blobs on this
same volume for the writable areas of the node. In the field, we
duplicate the problem with using TMPFS for that writable area. I am
happy to re-do the test with RO NFS and TMPFS for writable, which my
GUESS says the healing messages would go away. Would that help?
If you look at the heal count -- 76 -- that equals the node count - the
number of writable XFS image files using for writing for each node.

[root at leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared
<gfid:b9412b45-d380-4789-a335-af5af33bde24>
<gfid:80ea53ba-a960-402b-9c6c-1cc62b2c59b3>
<gfid:1f10c050-7c50-4044-abc5-0a980ac6af79>
<gfid:8847f8a4-5509-463d-ac49-836bf921858c>
<gfid:a35fef6b-9174-495f-a661-d9837a1243ac>
<gfid:782dd55f-d85d-4f5e-b76f-8dd562356a59>
<gfid:5ea92161-c91a-4d51-877c-a3362966e850>
<gfid:57e5c49d-36c9-4a70-afd5-34ffbddb7da5>
Status: Connected
Number of entries: 8

Brick 172.23.0.6:/data/brick_cm_shared
<gfid:b9412b45-d380-4789-a335-af5af33bde24>
<gfid:80ea53ba-a960-402b-9c6c-1cc62b2c59b3>
<gfid:1f10c050-7c50-4044-abc5-0a980ac6af79>
<gfid:8847f8a4-5509-463d-ac49-836bf921858c>
<gfid:a35fef6b-9174-495f-a661-d9837a1243ac>
<gfid:782dd55f-d85d-4f5e-b76f-8dd562356a59>
<gfid:5ea92161-c91a-4d51-877c-a3362966e850>
<gfid:57e5c49d-36c9-4a70-afd5-34ffbddb7da5>
Status: Connected
Number of entries: 8



Trial 3 - ran the heal command around the time the split-brain errors
were being reported


[root at leader2 glusterfs]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared
<gfid:80ea53ba-a960-402b-9c6c-1cc62b2c59b3>
<gfid:b9412b45-d380-4789-a335-af5af33bde24>
<gfid:08aff8a9-2818-44d6-a67d-d08c7894c496>
<gfid:8847f8a4-5509-463d-ac49-836bf921858c>
<gfid:57e5c49d-36c9-4a70-afd5-34ffbddb7da5>
<gfid:cd896244-f7e9-41ad-8510-d1fe5d0bf836>
<gfid:611fa1e0-dc0d-4ddc-9273-6035e51e1acf>
<gfid:686581b2-7515-4d0a-a1c8-369f01f60ecd>
<gfid:875e893b-f2ed-4805-95fd-6955ea310757>
<gfid:eb4203eb-06a4-4577-bddb-ba400d5cc7c7>
<gfid:4dd86ddd-aca3-403f-87eb-03a9c8116993>
<gfid:70c90d83-9fb7-4e8e-ac1b-592c4d2b1df8>
<gfid:de9de454-a8f4-4c3f-b8b8-b28b0c444e31>
<gfid:c44b7d98-f83b-4498-aa43-168ce4e35d52>
<gfid:61fde2e7-1898-4e5b-8b7f-f9702b595d3a>
<gfid:e44fd656-62a6-4c06-bafc-66de0ec99022>
<gfid:04aa47b5-52fa-47d0-9b5f-a39bc95eb1fe>
<gfid:6357f8f6-aa5b-40b8-a0f4-6c3366ff4fc2>
<gfid:19728e57-2cc9-4c3a-bb45-e72bc59f3e60>
<gfid:6e1fd334-43a7-4410-b3ef-6566d41d8574>
<gfid:d3b423da-484f-44a6-91d9-365e313bb2ef>
<gfid:da5215c1-565d-4419-beec-db50791de4c4>
<gfid:ff8348dc-8acc-40d5-a0ed-f9b3b5ba61ae>
<gfid:54523a5e-ccd7-4464-806e-3897f297b749>
<gfid:7bf00945-7b9a-46bb-8c73-bc233c644ca5>
<gfid:67ac7750-0b3c-4f88-aa8f-222183d39690>
<gfid:4f4da7fa-819d-45a4-bdb9-a81374b6df86>
<gfid:1b69ff6c-1dcc-4a9b-8c54-d4146cdfdd6c>
<gfid:e3bfb26e-7987-45cb-8824-99b353846c12>
<gfid:18b777df-72dc-4a57-a8a3-f22b54ceac3e>
<gfid:b6994926-5788-492b-8224-3a02100be9a2>
<gfid:434bb8e9-75a7-4670-960f-fefa6893da68>
<gfid:d4c4bc62-705b-405d-a4b4-941f8e55e5d2>
<gfid:f9d3580b-7b24-4061-819d-d62978fd35d0>
<gfid:14f73281-21c9-4830-9a39-1eb6998eb955>
<gfid:26d87a63-8318-4b6c-9093-4817cacc76ef>
<gfid:ff38c782-b28c-46cc-8a6b-e93a2c4d504f>
<gfid:7dbd1e30-c4e0-4b19-8f0d-d4ef9199b89f>
<gfid:1f10c050-7c50-4044-abc5-0a980ac6af79>
<gfid:23edaf65-7f90-47a7-bc1b-cccaf6648543>
<gfid:cf46ac85-50a8-4660-8a2f-564e4825f93e>
<gfid:27dbd511-cf7d-4fa8-bd98-dd2006a0a06b>
<gfid:76661586-1cc6-421d-b0ad-081c105b6532>
<gfid:4db3cac8-1fdb-4b52-9647-dd3979907773>
<gfid:3ffbc798-7733-4ef6-a253-3dc5259c20aa>
<gfid:ea2af645-29ef-4911-9dfd-0409ae1df5a5>
<gfid:a35fef6b-9174-495f-a661-d9837a1243ac>
<gfid:5ea92161-c91a-4d51-877c-a3362966e850>
<gfid:782dd55f-d85d-4f5e-b76f-8dd562356a59>
<gfid:036e02e5-1062-4b48-a090-6e363454aac5>
<gfid:8312c8bd-18c7-449e-8482-16320f3ee8e9>
<gfid:15cfabab-e8df-4cad-b883-b80861ee5775>
<gfid:f804bfed-b17a-4abf-a104-26b01569609b>
<gfid:77253670-1791-4910-9f0d-38c2b1ec0f17>
<gfid:ca502545-5ca2-4db6-baf4-b2eb0e4176f6>
<gfid:964ca255-b2e2-45e4-bb86-51d3e8a4c3f4>
<gfid:7bcfaddd-a65c-41f5-919b-8fb8b501f735>
<gfid:f884e860-6d3e-4597-9249-da0fc17c456f>
<gfid:5960eb89-3ca1-4d9e-8c13-16f0ee4822e3>
<gfid:361517df-19a8-4e43-b601-7822f7e36ef8>
<gfid:09e8541b-a150-41da-aff8-3764c33635ba>
<gfid:b30f6bdb-e439-44c5-bd4c-143486439091>
<gfid:ae983848-3ba9-4f72-ab0c-d309f96d2678>
<gfid:9cddb5cd-a721-4d63-9522-7546a9c01303>
<gfid:91c1b906-14a5-4fe1-8103-91d14134706b>
<gfid:55cc28b7-80f1-428e-9334-5a0742cce1c6>
<gfid:8183219a-6d4a-4369-82dc-2233e2eba656>
<gfid:6cfacb2b-e247-488b-adde-e00dfc0c25f8>
<gfid:5184933d-6470-47dc-a010-6f7cb5661160>
<gfid:66e6842c-fe87-4797-8a01-a9b0a4124cde>
<gfid:55884d32-2e3f-42ba-a173-6c9362e331e2>
<gfid:47b9316a-7896-4efd-8704-3acdde6f2cb8>
<gfid:3a3013e2-06dd-41c7-b759-bd9e945c9743>
<gfid:3dc0b834-6f3e-409a-9549-f015f3b66af1>
<gfid:c9fc97ce-f8bc-42b2-b428-37953c172a30>
<gfid:a94ac3ba-9777-4844-892b-5526c00f2f7b>
Status: Connected
Number of entries: 76

Brick 172.23.0.6:/data/brick_cm_shared
<gfid:9cddb5cd-a721-4d63-9522-7546a9c01303>
<gfid:4f4da7fa-819d-45a4-bdb9-a81374b6df86>
<gfid:91c1b906-14a5-4fe1-8103-91d14134706b>
<gfid:55cc28b7-80f1-428e-9334-5a0742cce1c6>
<gfid:18b777df-72dc-4a57-a8a3-f22b54ceac3e>
<gfid:b6994926-5788-492b-8224-3a02100be9a2>
<gfid:6cfacb2b-e247-488b-adde-e00dfc0c25f8>
<gfid:5184933d-6470-47dc-a010-6f7cb5661160>
<gfid:d4c4bc62-705b-405d-a4b4-941f8e55e5d2>
<gfid:f9d3580b-7b24-4061-819d-d62978fd35d0>
<gfid:14f73281-21c9-4830-9a39-1eb6998eb955>
<gfid:26d87a63-8318-4b6c-9093-4817cacc76ef>
<gfid:ff38c782-b28c-46cc-8a6b-e93a2c4d504f>
<gfid:7dbd1e30-c4e0-4b19-8f0d-d4ef9199b89f>
<gfid:1f10c050-7c50-4044-abc5-0a980ac6af79>
<gfid:23edaf65-7f90-47a7-bc1b-cccaf6648543>
<gfid:cf46ac85-50a8-4660-8a2f-564e4825f93e>
<gfid:27dbd511-cf7d-4fa8-bd98-dd2006a0a06b>
<gfid:76661586-1cc6-421d-b0ad-081c105b6532>
<gfid:4db3cac8-1fdb-4b52-9647-dd3979907773>
<gfid:3ffbc798-7733-4ef6-a253-3dc5259c20aa>
<gfid:ea2af645-29ef-4911-9dfd-0409ae1df5a5>
<gfid:80ea53ba-a960-402b-9c6c-1cc62b2c59b3>
<gfid:b9412b45-d380-4789-a335-af5af33bde24>
<gfid:a35fef6b-9174-495f-a661-d9837a1243ac>
<gfid:08aff8a9-2818-44d6-a67d-d08c7894c496>
<gfid:5ea92161-c91a-4d51-877c-a3362966e850>
<gfid:8847f8a4-5509-463d-ac49-836bf921858c>
<gfid:782dd55f-d85d-4f5e-b76f-8dd562356a59>
<gfid:57e5c49d-36c9-4a70-afd5-34ffbddb7da5>
<gfid:cd896244-f7e9-41ad-8510-d1fe5d0bf836>
<gfid:036e02e5-1062-4b48-a090-6e363454aac5>
<gfid:611fa1e0-dc0d-4ddc-9273-6035e51e1acf>
<gfid:8312c8bd-18c7-449e-8482-16320f3ee8e9>
<gfid:686581b2-7515-4d0a-a1c8-369f01f60ecd>
<gfid:15cfabab-e8df-4cad-b883-b80861ee5775>
<gfid:875e893b-f2ed-4805-95fd-6955ea310757>
<gfid:eb4203eb-06a4-4577-bddb-ba400d5cc7c7>
<gfid:f804bfed-b17a-4abf-a104-26b01569609b>
<gfid:4dd86ddd-aca3-403f-87eb-03a9c8116993>
<gfid:77253670-1791-4910-9f0d-38c2b1ec0f17>
<gfid:70c90d83-9fb7-4e8e-ac1b-592c4d2b1df8>
<gfid:de9de454-a8f4-4c3f-b8b8-b28b0c444e31>
<gfid:ca502545-5ca2-4db6-baf4-b2eb0e4176f6>
<gfid:c44b7d98-f83b-4498-aa43-168ce4e35d52>
<gfid:964ca255-b2e2-45e4-bb86-51d3e8a4c3f4>
<gfid:61fde2e7-1898-4e5b-8b7f-f9702b595d3a>
<gfid:7bcfaddd-a65c-41f5-919b-8fb8b501f735>
<gfid:e44fd656-62a6-4c06-bafc-66de0ec99022>
<gfid:04aa47b5-52fa-47d0-9b5f-a39bc95eb1fe>
<gfid:f884e860-6d3e-4597-9249-da0fc17c456f>
<gfid:6357f8f6-aa5b-40b8-a0f4-6c3366ff4fc2>
<gfid:19728e57-2cc9-4c3a-bb45-e72bc59f3e60>
<gfid:5960eb89-3ca1-4d9e-8c13-16f0ee4822e3>
<gfid:6e1fd334-43a7-4410-b3ef-6566d41d8574>
<gfid:361517df-19a8-4e43-b601-7822f7e36ef8>
<gfid:d3b423da-484f-44a6-91d9-365e313bb2ef>
<gfid:09e8541b-a150-41da-aff8-3764c33635ba>
<gfid:da5215c1-565d-4419-beec-db50791de4c4>
<gfid:ff8348dc-8acc-40d5-a0ed-f9b3b5ba61ae>
<gfid:b30f6bdb-e439-44c5-bd4c-143486439091>
<gfid:54523a5e-ccd7-4464-806e-3897f297b749>
<gfid:ae983848-3ba9-4f72-ab0c-d309f96d2678>
<gfid:7bf00945-7b9a-46bb-8c73-bc233c644ca5>
<gfid:67ac7750-0b3c-4f88-aa8f-222183d39690>
<gfid:1b69ff6c-1dcc-4a9b-8c54-d4146cdfdd6c>
<gfid:e3bfb26e-7987-45cb-8824-99b353846c12>
<gfid:8183219a-6d4a-4369-82dc-2233e2eba656>
<gfid:434bb8e9-75a7-4670-960f-fefa6893da68>
<gfid:66e6842c-fe87-4797-8a01-a9b0a4124cde>
<gfid:55884d32-2e3f-42ba-a173-6c9362e331e2>
<gfid:47b9316a-7896-4efd-8704-3acdde6f2cb8>
<gfid:3a3013e2-06dd-41c7-b759-bd9e945c9743>
<gfid:3dc0b834-6f3e-409a-9549-f015f3b66af1>
<gfid:c9fc97ce-f8bc-42b2-b428-37953c172a30>
<gfid:a94ac3ba-9777-4844-892b-5526c00f2f7b>
Status: Connected
Number of entries: 76

-------------- next part --------------
 
Volume Name: cm_shared
Type: Replicate
Volume ID: f6175f56-8422-4056-9891-f9ba84756b87
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.0.4:/data/brick_cm_shared
Brick2: 172.23.0.5:/data/brick_cm_shared
Brick3: 172.23.0.6:/data/brick_cm_shared
Options Reconfigured:
nfs.event-threads: 3
config.brick-threads: 16
config.client-threads: 16
performance.iot-pass-through: false
config.global-threading: off
performance.client-io-threads: on
nfs.disable: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
cluster.lookup-optimize: on
client.event-threads: 32
server.event-threads: 32
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 1000000
performance.io-thread-count: 32
performance.cache-size: 8GB
performance.parallel-readdir: on
cluster.lookup-unhashed: auto
performance.flush-behind: on
performance.aggregate-size: 2048KB
performance.write-behind-trickling-writes: off
transport.listen-backlog: 16384
performance.write-behind-window-size: 1024MB
server.outstanding-rpc-limit: 1024
nfs.outstanding-rpc-limit: 1024
nfs.acl: on
storage.max-hardlinks: 0
performance.cache-refresh-timeout: 60
performance.md-cache-statfs: off
performance.nfs.io-cache: on
nfs.mount-rmtab: /-
nfs.nlm: off
nfs.export-volumes: on
nfs.export-dirs: on
nfs.exports-auth-enable: on
nfs.auth-refresh-interval-sec: 360
nfs.auth-cache-ttl-sec: 360
cluster.favorite-child-policy: none
nfs.mem-factor: 15
cluster.choose-local: true
network.ping-timeout: 42
cluster.read-hash-mode: 1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nfs.log.xz
Type: application/x-xz
Size: 4992 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200329/8d6d1423/attachment.xz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glustershd.log.xz
Type: application/x-xz
Size: 4080 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200329/8d6d1423/attachment-0001.xz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glusterd.log.xz
Type: application/x-xz
Size: 4244 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200329/8d6d1423/attachment-0002.xz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data-brick_cm_shared.log.xz
Type: application/x-xz
Size: 11376 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200329/8d6d1423/attachment-0003.xz>
-------------- next part --------------
Status of volume: cm_shared
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.23.0.5:/data/brick_cm_shared      49153     0          Y       50199
Brick 172.23.0.6:/data/brick_cm_shared      49153     0          Y       59380
Self-heal Daemon on localhost               N/A       N/A        Y       10817
NFS Server on localhost                     2049      0          Y       10775
Self-heal Daemon on 172.23.0.5              N/A       N/A        Y       16645
NFS Server on 172.23.0.5                    2049      0          Y       16603
 
Task Status of Volume cm_shared
------------------------------------------------------------------------------
There are no active volume tasks

Gluster users - Mar 2020 - gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request