Erik Jacobson
2019-Sep-16 14:04 UTC
[Gluster-users] split-brain errors under heavy load when one brick down
Hello all. I'm new to the list but not to gluster. We are using gluster to service NFS boot on a top500 cluster. It is a Distributed-Replicate volume 3x9. We are having a problem when one server in a subvolume goes down, we get random missing files and split-brain errors in the nfs.log file. We are using Gluster NFS (We are interested in switching to Ganesha but this workload presents problems there that we need to work through yet). Unfortunately, like many such large systems, I am unable to take much out of the system for debugging and unable to take the system down to test this very often. However, my hope is to be well prepared when the next large system comes through the factory so I can try to reproduce this issue or have some things to try. In the lab, I have a test system that is also a 3x9 setup like at the customer site, but with only 3 compute nodes instead of 2,592 compute nodes. We use CTDB for IP alias management - the compute nodes connect to NFS with the alias. Here is the issue we are having: - 2592 nodes all PXE-booting at once and using the Gluster servers as their NFS root is working great. This includes when one subvolume is degraded due to the loss of a server. No issues at boot, no split-brain messages in the log. - The problem comes in when we do an intensive job launch. This launch uses SLURM and then loads hundreds of shared libraries over NFS across all 2592 nodes. - When all servers in the 3x9 pool are up, we're in good shape - no issues on the compute nodes, no split-brain messages in the log. - When one subvolume has one missing server (its ethernet adapters died), while we boot fine, the SLURM launch has random missing files. Gluster nfs.log shows split-brain messages and ACCESS I/O errors. - Taking an example failed file and accessing it across all compute nodes always works afterwords, the issue is transient. - The missing file is always found in the other bricks in the subvolume by searching there is well - No FS/disk IO errors in the logs or dmesg and the files are accessible before and after the transient error (and from the bricks themselves as I said). - The customer jobs fail to launch, then, if we are degraded. They fail with library read errors, missing config files, etc. What is perplexing is the huge load of 2592 nodes with NFS roots PXE-booting does not trigger the issue when one subvolume is degraded. Thank you for reading this far and thanks to the community for making Gluster!! Example errors: ex1 [2019-09-06 18:26:42.665050] E [MSGID: 108008] [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed. [Input/output error] ex2 [2019-09-06 18:26:55.359272] E [MSGID: 108008] [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed. [Input/output error] [2019-09-06 18:26:55.359367] W [MSGID: 112199] [nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3: /image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) The errors seem to happen only on the 'replicate' volume where one server is down in the subvolume (of course, any NFS server will trigger that when it accesses the files on the degraded volume). Now, I am no longer able to access this customer system and it is moving to more secret work so I can't easily run tests on such a big system until we have something come through the factory. However, I'm desperate for help and would like a bag of tricks to attack this with next time I can hit it. Having the HA stuff fail when needed has given me a bit of a black eye on the solution. I had a lesson learned in being sure to test the HA solution. I had tested many times at full system boot but didn't think to do job launch tests while degraded in my testing. That pain will haunt me but also make me better. Info on the volumes: - RHEL 7.6 x86_64 Gluster/GNFS servers - gluster version 4.1.6, I set up the build - Clients are AARCH64 NFS 3 clients (technically configured with RO NFS (Using a version of Linux somewhat like CentOS 7.6). - The base filesystems for bricks are XFS and NO LVM layer. What follows is the volume info from my test system in the lab, which has the same versions and setup. I cannot get this info from the customer without an approval process but the same scripts and tools set up my test system so I'm confident the settings are the same. [root at leader1 ~]# gluster volume info Volume Name: cm_shared Type: Distributed-Replicate Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731 Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_cm_shared Brick2: 172.23.0.4:/data/brick_cm_shared Brick3: 172.23.0.5:/data/brick_cm_shared Brick4: 172.23.0.6:/data/brick_cm_shared Brick5: 172.23.0.7:/data/brick_cm_shared Brick6: 172.23.0.8:/data/brick_cm_shared Brick7: 172.23.0.9:/data/brick_cm_shared Brick8: 172.23.0.10:/data/brick_cm_shared Brick9: 172.23.0.11:/data/brick_cm_shared Options Reconfigured: nfs.nlm: off nfs.mount-rmtab: /- performance.nfs.io-cache: on performance.md-cache-statfs: off performance.cache-refresh-timeout: 60 storage.max-hardlinks: 0 nfs.acl: on nfs.outstanding-rpc-limit: 1024 server.outstanding-rpc-limit: 1024 performance.write-behind-window-size: 1024MB transport.listen-backlog: 16384 performance.write-behind-trickling-writes: off performance.aggregate-size: 2048KB performance.flush-behind: on cluster.lookup-unhashed: auto performance.parallel-readdir: on performance.cache-size: 8GB performance.io-thread-count: 32 network.inode-lru-limit: 1000000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on server.event-threads: 32 client.event-threads: 32 cluster.lookup-optimize: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet nfs.disable: false performance.client-io-threads: on Volume Name: ctdb Type: Replicate Volume ID: 5274a6ce-2ac9-4fc7-8145-dd2b8a97ff3b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 9 = 9 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_ctdb Brick2: 172.23.0.4:/data/brick_ctdb Brick3: 172.23.0.5:/data/brick_ctdb Brick4: 172.23.0.6:/data/brick_ctdb Brick5: 172.23.0.7:/data/brick_ctdb Brick6: 172.23.0.8:/data/brick_ctdb Brick7: 172.23.0.9:/data/brick_ctdb Brick8: 172.23.0.10:/data/brick_ctdb Brick9: 172.23.0.11:/data/brick_ctdb Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Here is the setting detail on the cm_shared volume - the one used for GNFS: [root at leader1 ~]# gluster volume get cm_shared all Option Value ------ ----- cluster.lookup-unhashed auto cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal on cluster.data-self-heal on cluster.entry-self-heal on cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm (null) cluster.eager-lock on disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type auto cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes cluster.stripe-block-size 128KB cluster.stripe-coalesce true diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 60 performance.cache-priority performance.cache-size 8GB performance.io-thread-count 32 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 8GB performance.qr-cache-timeout 1 performance.cache-invalidation on performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1024MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writesoff performance.aggregate-size 2048KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open no performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 600 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false features.encryption off encryption.master-key (null) encryption.data-key-size 256 encryption.block-size 4096 network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) network.remote-dio disable client.event-threads 32 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 1000000 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 1024 server.ssl (null) auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 32 server.tcp-user-timeout 0 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 16384 ssl.own-cert (null) ssl.private-key (null) ssl.ca-list (null) ssl.crl-path (null) ssl.certificate-depth (null) ssl.cipher-list (null) ssl.dh-param (null) ssl.ec-curve (null) transport.address-family inet performance.write-behind on performance.read-ahead on performance.readdir-ahead on performance.io-cache on performance.quick-read on performance.open-behind on performance.nl-cache off performance.stat-prefetch on performance.client-io-threads on performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache on performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true performance.cache-invalidation on features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.enable-ino32 no nfs.mem-factor 15 nfs.export-dirs on nfs.export-volumes on nfs.addr-namelookup off nfs.dynamic-volumes off nfs.register-with-portmap on nfs.outstanding-rpc-limit 1024 nfs.port 2049 nfs.rpc-auth-unix on nfs.rpc-auth-null on nfs.rpc-auth-allow all nfs.rpc-auth-reject none nfs.ports-insecure off nfs.trusted-sync off nfs.trusted-write off nfs.volume-access read-write nfs.export-dir nfs.disable false nfs.nlm off nfs.acl on nfs.mount-udp off nfs.mount-rmtab /- nfs.rpc-statd /sbin/rpc.statd nfs.server-aux-gids off nfs.drc off nfs.drc-size 0x20000 nfs.read-size (1 * 1048576ULL) nfs.write-size (1 * 1048576ULL) nfs.readdir-size (1 * 1048576ULL) nfs.rdirplus on nfs.event-threads 1 nfs.exports-auth-enable (null) nfs.auth-refresh-interval-sec (null) nfs.auth-cache-ttl-sec (null) features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.health-check-timeout 10 storage.fips-mode-rchecksum off storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 0 storage.ctime off storage.bd-aio off config.gfproxyd off cluster.server-quorum-type off cluster.server-quorum-ratio 0 changelog.changelog off changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable cluster.write-freq-threshold 0 cluster.read-freq-threshold 0 cluster.tier-pause off cluster.tier-promote-frequency 120 cluster.tier-demote-frequency 3600 cluster.watermark-hi 90 cluster.watermark-low 75 cluster.tier-mode cache cluster.tier-max-promote-file-size 0 cluster.tier-max-mb 4000 cluster.tier-max-files 10000 cluster.tier-query-limit 100 cluster.tier-compact on cluster.tier-hot-compact-frequency 604800 cluster.tier-cold-compact-frequency 604800 features.ctr-enabled off features.record-counters off features.ctr-record-metadata-heat off features.ctr_link_consistency off features.ctr_lookupheal_link_timeout 300 features.ctr_lookupheal_inode_timeout 300 features.ctr-sql-db-cachesize 12500 features.ctr-sql-db-wal-autocheckpoint 25000 features.selinux on locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.shard off features.shard-block-size 64MB features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation on features.cache-invalidation-timeout 600 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 1 cluster.shd-wait-qlength 1024 cluster.locking-scheme full cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir on performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 60 cluster.brick-multiplex off cluster.max-bricks-per-process 0 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs off features.cloudsync off features.utime off Erik
Ravishankar N
2019-Sep-18 04:41 UTC
[Gluster-users] split-brain errors under heavy load when one brick down
On 16/09/19 7:34 pm, Erik Jacobson wrote:> Example errors: > > ex1 > > [2019-09-06 18:26:42.665050] E [MSGID: 108008] > [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing > ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed. > [Input/output error]Okay so 0-cm_shared-replicate-1 means these 3 bricks: Brick4: 172.23.0.6:/data/brick_cm_shared Brick5: 172.23.0.7:/data/brick_cm_shared Brick6: 172.23.0.8:/data/brick_cm_shared> > ex2 > > [2019-09-06 18:26:55.359272] E [MSGID: 108008] > [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing > READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed. > [Input/output error] > [2019-09-06 18:26:55.359367] W [MSGID: 112199] > [nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3: > /image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80, > READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) > > > > The errors seem to happen only on the 'replicate' volume where one > server is down in the subvolume (of course, any NFS server will > trigger that when it accesses the files on the degraded volume).Were there any pending self-heals for this volume? Is it possible that the server (one of Brick 4, 5 or 6 ) that is down had the only good copy and the other 2 online bricks had a bad copy (needing heal)? Clients can get EIO in that case. When you say accessing the file from the compute nodes afterwards works fine, it is still with that one server (brick) down? There was a case of AFR reporting spurious split-brain errors but that was fixed long back (http://review.gluster.org/16362) and seems to be present in glusterf-4.1.6. Side note: Why are you using replica 9 for the ctdb volume? All development/tests are usually done on (distributed) replica 3 setup. Thanks, Ravi