Erik Jacobson
2020-Mar-29 04:10 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Hello all, I am getting split-brain errors in the gnfs nfs.log when 1 gluster server is down in a 3-brick/3-node gluster volume. It only happens under intense load. I reported this a few months ago but didn't have a repeatable test case. Since then, we got reports from the field and I was able to make a test case with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76 nodes to one gnfs server to make the problem more likely to happen with the limited nodes we have in-house. We are using gluster nfs (ganesha is not yet reliable for our workload) to export an NFS filesystem that is used for a read-only root filesystem for NFS clients. The largest client count we have is 2592 across 9 leaders (3 replicated subvolumes) - out in the field. This is where the problem was first reported. In the lab, I have a test case that can repeat the problem on a single subvolume cluster. Please forgive how ugly the test case is. I'm sure an IO test person can make it pretty. It basically runs a bunch of cluster-manger NFS-intensive operations while also producing other load. If one leader is down, nfs.log reports some split-brain errors. For real-world customers, the symptom is "some nodes failing to boot" in various ways or "jobs failing to launch due to permissions or file read problems (like a library not being readable on one node)". If all leaders are up, we see no errors. As an attachment, I will include volume settings. Here are example nfs.log errors: [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error] [2020-03-29 03:42:52.295583] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/bin/whoami => (XID: 19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error)) [2020-03-29 03:43:03.600023] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: split-brain observed. [Input/output error] [2020-03-29 03:43:03.600075] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm => (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error)) [2020-03-29 03:43:07.681294] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: split-brain observed. [Input/output error] [2020-03-29 03:43:07.681339] W [MSGID: 112199] [nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/.libhogweed.so.4.hmac => (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) The brick log isn't very interesting during the failure. There are some ACL errors that don't seem to directly relate to the issue at hand. (I can attach if requested!) This is glusterfs72 (although we originally hit it with 4.1.6). I'm using rhel8 (although field reports are from rhel76). If there is anything the community can suggest to help me with this, it would really be appreciated. I'm getting unhappy reports from the field that the failover doesn't work as expected. I've tried tweaking several things from various threading settings to enabling md-cach-statfs to mem-factor to listen backlogs. I even tried adjusting the cluster.read-hash-mode and choose-local settings. "cluster-configuration" in the script initiates a bunch of operations on the node that results in reading many files and doing some database queries. I used it in my test case as it is a common failure point when nodes are booting. This test case, although ugly, fails 100% if one server is down and works 100% if all servers are up. #! /bin/bash # # Test case: # # in a 1x3 Gluster Replicated setup with the HPCM volume settings.. # # On a cluster with 76 nodes (maybe can be replicated with less we don't # know) # # When all the nodes are assigned to one IP alias to get the load in to # one leader node.... # # This test case will produce split-brain errors in the nfs.log file # when 1 leader is down, but will run clean when all 3 are up. # # It is not necessary to power off the leader you wish to disable. Simply # running 'systemctl stop glusterd' is sufficient. # # We will use this script to try to resolve the issue with split-brain # under stress when one leader is down. # # (compute group is 76 compute nodes) echo "killing any node find or node tar commands..." pdsh -f 500 -g compute killall find pdsh -f 500 -g compute killall tar # (in this test, leader1 is known to have glusterd stopped for the test case) echo "stop, start glusterd, drop caches, sleep 15" set -x pdsh -w leader2,leader3 systemctl stop glusterd sleep 3 pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches" pdsh -w leader2,leader3 systemctl start glusterd set +x sleep 15 echo "drop caches on nodes" pdsh -f 500 -g compute "echo 3 > /proc/sys/vm/drop_caches" echo "----------------------------------------------------------------------" echo "test start" echo "----------------------------------------------------------------------" set -x pdsh -f 500 -g compute "tar cf - /usr > /dev/null" & pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration pdsh -f 500 -g compute "find /usr > /dev/null" & pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration wait -------------- next part -------------- Option Value ------ ----- cluster.lookup-unhashed auto cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal off cluster.data-self-heal off cluster.entry-self-heal off cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm (null) cluster.eager-lock on disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type auto cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes cluster.optimistic-change-log on diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 60 performance.cache-priority performance.cache-size 8GB performance.io-thread-count 32 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 8GB performance.qr-cache-timeout 1 performance.cache-invalidation on performance.ctime-invalidation false performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1024MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writesoff performance.aggregate-size 2048KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open yes performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 600 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) client.ssl off network.remote-dio disable client.event-threads 32 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 1000000 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.all-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 1024 server.ssl off auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 32 server.tcp-user-timeout 42 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 16384 transport.address-family inet performance.write-behind on performance.read-ahead on performance.readdir-ahead on performance.io-cache on performance.open-behind on performance.quick-read on performance.nl-cache off performance.stat-prefetch on performance.client-io-threads on performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache on performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true performance.cache-invalidation on performance.global-cache-invalidation true features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.enable-ino32 no nfs.mem-factor 15 nfs.export-dirs on nfs.export-volumes on nfs.addr-namelookup off nfs.dynamic-volumes off nfs.register-with-portmap on nfs.outstanding-rpc-limit 1024 nfs.port 2049 nfs.rpc-auth-unix on nfs.rpc-auth-null on nfs.rpc-auth-allow all nfs.rpc-auth-reject none nfs.ports-insecure off nfs.trusted-sync off nfs.trusted-write off nfs.volume-access read-write nfs.export-dir nfs.disable off nfs.nlm off nfs.acl on nfs.mount-udp off nfs.mount-rmtab /- nfs.rpc-statd /sbin/rpc.statd nfs.server-aux-gids off nfs.drc off nfs.drc-size 0x20000 nfs.read-size (1 * 1048576ULL) nfs.write-size (1 * 1048576ULL) nfs.readdir-size (1 * 1048576ULL) nfs.rdirplus on nfs.event-threads 3 nfs.exports-auth-enable on nfs.auth-refresh-interval-sec 360 nfs.auth-cache-ttl-sec 360 features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.reserve-size 0 storage.health-check-timeout 10 storage.fips-mode-rchecksum on storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 0 features.ctime on config.gfproxyd off cluster.server-quorum-type off cluster.server-quorum-ratio 51 changelog.changelog off changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.shard off features.shard-block-size 64MB features.shard-lru-limit 16384 features.shard-deletion-rate 100 features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation on features.cache-invalidation-timeout 600 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 1 cluster.shd-wait-qlength 1024 cluster.locking-scheme full cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir on performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 60 cluster.brick-multiplex disable glusterd.vol_count_per_thread 100 cluster.max-bricks-per-process 250 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 features.selinux on cluster.daemon-log-level INFO debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs off features.cloudsync off features.ctime on ctime.noatime on features.cloudsync-storetype (null) features.enforce-mandatory-lock off config.global-threading off config.client-threads 16 config.brick-threads 16 features.cloudsync-remote-read off features.cloudsync-store-id (null) features.cloudsync-product-id (null)
Strahil Nikolov
2020-Mar-29 21:39 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
On March 29, 2020 7:10:49 AM GMT+03:00, Erik Jacobson <erik.jacobson at hpe.com> wrote:>Hello all, > >I am getting split-brain errors in the gnfs nfs.log when 1 gluster >server is down in a 3-brick/3-node gluster volume. It only happens >under >intense load. > >I reported this a few months ago but didn't have a repeatable test >case. >Since then, we got reports from the field and I was able to make a test >case >with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76 >nodes to one gnfs server to make the problem more likely to happen with >the >limited nodes we have in-house. > >We are using gluster nfs (ganesha is not yet reliable for our workload) >to export an NFS filesystem that is used for a read-only root >filesystem >for NFS clients. The largest client count we have is 2592 across 9 >leaders (3 replicated subvolumes) - out in the field. This is where >the problem was first reported. > >In the lab, I have a test case that can repeat the problem on a single >subvolume cluster. > >Please forgive how ugly the test case is. I'm sure an IO test person >can >make it pretty. It basically runs a bunch of cluster-manger >NFS-intensive >operations while also producing other load. If one leader is down, >nfs.log reports some split-brain errors. For real-world customers, the >symptom is "some nodes failing to boot" in various ways or "jobs >failing >to launch due to permissions or file read problems (like a library not >being readable on one node)". If all leaders are up, we see no errors. > >As an attachment, I will include volume settings. > >Here are example nfs.log errors: > > >[2020-03-29 03:42:52.295532] E [MSGID: 108008] >[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: >Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: >split-brain observed. [Input/output error] >[2020-03-29 03:42:52.295583] W [MSGID: 112199] >[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: ><gfid:9e721602-2732-4490-bde3-19cac6e33291>/bin/whoami => (XID: >19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error)) >[2020-03-29 03:43:03.600023] E [MSGID: 108008] >[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: >Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: >split-brain observed. [Input/output error] >[2020-03-29 03:43:03.600075] W [MSGID: 112199] >[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: ><gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm >=> (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output >error)) >[2020-03-29 03:43:07.681294] E [MSGID: 108008] >[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: >Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: >split-brain observed. [Input/output error] >[2020-03-29 03:43:07.681339] W [MSGID: 112199] >[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: ><gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/.libhogweed.so.4.hmac >=> (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output >error)) target: (null) > > >The brick log isn't very interesting during the failure. There are some >ACL errors that don't seem to directly relate to the issue at hand. >(I can attach if requested!) > >This is glusterfs72 (although we originally hit it with 4.1.6). >I'm using rhel8 (although field reports are from rhel76). > >If there is anything the community can suggest to help me with this, it >would really be appreciated. I'm getting unhappy reports from the field >that the failover doesn't work as expected. > >I've tried tweaking several things from various threading settings to >enabling md-cach-statfs to mem-factor to listen backlogs. I even tried >adjusting the cluster.read-hash-mode and choose-local settings. > >"cluster-configuration" in the script initiates a bunch of operations >on the >node that results in reading many files and doing some database >queries. I >used it in my test case as it is a common failure point when nodes are >booting. This test case, although ugly, fails 100% if one server is >down and >works 100% if all servers are up. > > >#! /bin/bash > ># ># Test case: ># ># in a 1x3 Gluster Replicated setup with the HPCM volume settings.. ># ># On a cluster with 76 nodes (maybe can be replicated with less we >don't ># know) ># ># When all the nodes are assigned to one IP alias to get the load in to ># one leader node.... ># ># This test case will produce split-brain errors in the nfs.log file ># when 1 leader is down, but will run clean when all 3 are up. ># ># It is not necessary to power off the leader you wish to disable. >Simply ># running 'systemctl stop glusterd' is sufficient. ># ># We will use this script to try to resolve the issue with split-brain ># under stress when one leader is down. ># > ># (compute group is 76 compute nodes) >echo "killing any node find or node tar commands..." >pdsh -f 500 -g compute killall find >pdsh -f 500 -g compute killall tar > ># (in this test, leader1 is known to have glusterd stopped for the test >case) >echo "stop, start glusterd, drop caches, sleep 15" >set -x >pdsh -w leader2,leader3 systemctl stop glusterd >sleep 3 >pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches" >pdsh -w leader2,leader3 systemctl start glusterd >set +x >sleep 15 > >echo "drop caches on nodes" >pdsh -f 500 -g compute "echo 3 > /proc/sys/vm/drop_caches" > >echo >"----------------------------------------------------------------------" >echo "test start" >echo >"----------------------------------------------------------------------" > >set -x > > >pdsh -f 500 -g compute "tar cf - /usr > /dev/null" & >pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration >pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration >pdsh -f 500 -g compute "find /usr > /dev/null" & >pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration >pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration >waitHey Erik, That's odd. As far as I know, the client's are accessing one of the gluster nodes that serves as NFS server and then syncs data across the peers ,right? What happens when the virtual IP(s) are failed over to the other gluster node? Is the issue resolved? Do you get any split brain entries via 'gluster volume geal <VOL> info' ? Also, what kind of load balancing are you using ? Best Regards, Strahil Nikolov
Ravishankar N
2020-Mar-30 04:22 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
On 29/03/20 9:40 am, Erik Jacobson wrote:> Hello all, > > I am getting split-brain errors in the gnfs nfs.log when 1 gluster > server is down in a 3-brick/3-node gluster volume. It only happens under > intense load. > > In the lab, I have a test case that can repeat the problem on a single > subvolume cluster. > > If all leaders are up, we see no errors. > > > Here are example nfs.log errors: > > > [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error] >Since you say that the errors go away when all 3 bricks (which I guess is what you refer to as 'leaders') of the replica are up, it could be possible that the brick you brought down had the only good copy. In such cases, even though you have the other 2 bricks of the replica up, they both are bad copies waiting to be healed and hence all operations on those files will fail with EIO. Since you say this occurs under high load only. I suspect this is the case since heal hasn't had the time to catch up with the nodes going up and down. If you see the split-brain errors despite all 3 replica bricks being online and the gnfs server being able to connect to all of them, then it could be a genuine split-brain problem. But I don't think that is the case here. Regards, Ravi