Hi David, On Dec 24, 2019 02:47, David Cunningham <dcunningham at voisonics.com> wrote:> > Hello, > > In testing we found that actually the GFS client having access to all 3 nodes made no difference to performance. Perhaps that's because the 3rd node that wasn't accessible from the client before was the arbiter node?It makes sense, as no data is being generated towards the arbiter.> Presumably we shouldn't have an arbiter node listed under backupvolfile-server when mounting the filesystem? Since it doesn't store all the data surely it can't be used to serve the data.I have my arbiter defined as last backup and no issues so far. At least the admin can easily identify the bricks from the mount options.> We did have direct-io-mode=disable already as well, so that wasn't a factor in the performance problems.Have you checked if the client vedsion ia not too old. Also you can check the cluster's operation cersion: # gluster volume get all cluster.max-op-version # gluster volume get all cluster.op-version Cluster's op version should be at max-op-version. In my mind come 2 options: A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and then set the op version to highest possible. # gluster volume get all cluster.max-op-version # gluster volume get all cluster.op-version B) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and control the parallel connections from Ganesha). Can you provide your Gluster volume's options? 'gluster volume get <VOLNAME> all'> Thanks again for any advice. > > > > On Mon, 23 Dec 2019 at 13:09, David Cunningham <dcunningham at voisonics.com> wrote: >> >> Hi Strahil, >> >> Thanks for that. We do have one backup server specified, but will add the second backup as well. >> >> >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg at yahoo.com> wrote: >>> >>> Hi David, >>> >>> Also consider using the? mount option to specify backup server via 'backupvolfile-server=server2:server3' (you can define more but I don't thing replica volumes? greater that 3 are usefull (maybe? in some special cases). >>> >>> In such way, when the primary is lost, your client can reach a backup one without disruption. >>> >>> P.S.: Client may 'hang' - if the primary server got rebooted ungracefully - as the communication must timeout before FUSE addresses the next server. There is a special script for? killing gluster processes in '/usr/share/gluster/scripts' which can be used? for? setting up a systemd service to do that for you on shutdown. >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> On Dec 20, 2019 23:49, David Cunningham <dcunningham at voisonics.com> wrote: >>>> >>>> Hi Stahil, >>>> >>>> Ah, that is an important point. One of the nodes is not accessible from the client, and we assumed that it only needed to reach the GFS node that was mounted so didn't think anything of it. >>>> >>>> We will try making all nodes accessible, as well as "direct-io-mode=disable". >>>> >>>> Thank you. >>>> >>>> >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg at yahoo.com> wrote: >>>>> >>>>> Actually I haven't clarified myself. >>>>> FUSE mounts on the client side is connecting directly to all bricks consisted of the volume. >>>>> If for some reason (bad routing, firewall blocked) there could be cases where the client can reach 2 out of 3 bricks and this can constantly cause healing to happen (as one of the bricks is never updated) which will degrade the performance and cause excessive network usage. >>>>> As your attachment is from one of the gluster nodes, this could be the case. >>>>> >>>>> Best Regards, >>>>> Strahil Nikolov >>>>> >>>>> ? ?????, 20 ???????? 2019 ?., 01:49:56 ?. ???????+2, David Cunningham <dcunningham at voisonics.com> ??????: >>>>> >>>>> >>>>> Hi Strahil, >>>>> >>>>> The chart attached to my original email is taken from the GFS server. >>>>> >>>>> I'm not sure what you mean by accessing all bricks simultaneously. We've mounted it from the client like this: >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10 0 0 >>>>> >>>>> Should we do something different to access all bricks simultaneously? >>>>> >>>>> Thanks for your help! >>>>> >>>>> >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <hunter86_bg at yahoo.com> wrote: >>>>>> >>>>>> I'm not sure if you did measure the traffic from client side (tcpdump on a client machine) or from Server side. >>>>>> >>>>>> In both cases , please verify that the client accesses all bricks simultaneously, as this can cause unnecessary heals. >>>>>> >>>>>> Have you thought about upgrading to v6? There are some enhancements in v6 which could be beneficial. >>>>>> >>>>>> Yet, it is indeed strange that so much traffic is generated with FUSE. >>>>>> >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and can natively speak with Gluster, which cant bring you closer to the previous setup and also provide some extra performance. >>>>>> >>>>>> >>>>>> Best Regards, >>>>>> Strahil Nikolov >>>>>> >>>>>> >>>>>> >> >> >> -- >> David Cunningham, Voisonics Limited >> http://voisonics.com/ >> USA: +1 213 221 1092 >> New Zealand: +64 (0)28 2558 3782 > > > > -- > David Cunningham, Voisonics Limited > http://voisonics.com/ > USA: +1 213 221 1092 > New Zealand: +64 (0)28 2558 3782Best Regards, Strahil Nikolov -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191224/4be97e1b/attachment-0001.html>
David Cunningham
2019-Dec-27 01:22 UTC
[Gluster-users] GFS performance under heavy traffic
Hi Strahil, Our volume options are as below. Thanks for the suggestion to upgrade to version 6 or 7. We could do that be simply removing the current installation and installing the new one (since it's not live right now). We might have to convince the customer that it's likely to succeed though, as at the moment I think they believe that GFS is not going to work for them. Option Value ------ ----- cluster.lookup-unhashed on cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal on cluster.data-self-heal on cluster.entry-self-heal on cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm (null) cluster.eager-lock on disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type none cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes cluster.stripe-block-size 128KB cluster.stripe-coalesce true diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 1 performance.cache-priority performance.cache-size 32MB performance.io-thread-count 16 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 128MB performance.qr-cache-timeout 1 performance.cache-invalidation false performance.ctime-invalidation false performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writeson performance.aggregate-size 128KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open yes performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 1 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false features.encryption off encryption.master-key (null) encryption.data-key-size 256 encryption.block-size 4096 network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) network.remote-dio disable client.event-threads 2 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 16384 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 server.ssl (null) auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 1 server.tcp-user-timeout 0 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 1024 ssl.own-cert (null) ssl.private-key (null) ssl.ca-list (null) ssl.crl-path (null) ssl.certificate-depth (null) ssl.cipher-list (null) ssl.dh-param (null) ssl.ec-curve (null) transport.address-family inet performance.write-behind on performance.read-ahead on performance.readdir-ahead on performance.io-cache on performance.quick-read on performance.open-behind on performance.nl-cache off performance.stat-prefetch on performance.client-io-threads off performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache off performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true performance.cache-invalidation false features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.disable on features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.health-check-timeout 10 storage.fips-mode-rchecksum off storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 100 storage.ctime off storage.bd-aio off config.gfproxyd off cluster.server-quorum-type off cluster.server-quorum-ratio 0 changelog.changelog off changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable cluster.write-freq-threshold 0 cluster.read-freq-threshold 0 cluster.tier-pause off cluster.tier-promote-frequency 120 cluster.tier-demote-frequency 3600 cluster.watermark-hi 90 cluster.watermark-low 75 cluster.tier-mode cache cluster.tier-max-promote-file-size 0 cluster.tier-max-mb 4000 cluster.tier-max-files 10000 cluster.tier-query-limit 100 cluster.tier-compact on cluster.tier-hot-compact-frequency 604800 cluster.tier-cold-compact-frequency 604800 features.ctr-enabled off features.record-counters off features.ctr-record-metadata-heat off features.ctr_link_consistency off features.ctr_lookupheal_link_timeout 300 features.ctr_lookupheal_inode_timeout 300 features.ctr-sql-db-cachesize 12500 features.ctr-sql-db-wal-autocheckpoint 25000 features.selinux on locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.shard off features.shard-block-size 64MB features.shard-lru-limit 16384 features.shard-deletion-rate 100 features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation off features.cache-invalidation-timeout 60 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 1 cluster.shd-wait-qlength 1024 cluster.locking-scheme full cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir off performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 60 cluster.brick-multiplex off cluster.max-bricks-per-process 0 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 cluster.daemon-log-level INFO debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs on features.cloudsync off features.utime off ctime.noatime on feature.cloudsync-storetype (null) Thanks again. On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86_bg at yahoo.com> wrote:> Hi David, > > On Dec 24, 2019 02:47, David Cunningham <dcunningham at voisonics.com> wrote: > > > > Hello, > > > > In testing we found that actually the GFS client having access to all 3 > nodes made no difference to performance. Perhaps that's because the 3rd > node that wasn't accessible from the client before was the arbiter node? > It makes sense, as no data is being generated towards the arbiter. > > Presumably we shouldn't have an arbiter node listed under > backupvolfile-server when mounting the filesystem? Since it doesn't store > all the data surely it can't be used to serve the data. > > I have my arbiter defined as last backup and no issues so far. At least > the admin can easily identify the bricks from the mount options. > > > We did have direct-io-mode=disable already as well, so that wasn't a > factor in the performance problems. > > Have you checked if the client vedsion ia not too old. > Also you can check the cluster's operation cersion: > # gluster volume get all cluster.max-op-version > # gluster volume get all cluster.op-version > > Cluster's op version should be at max-op-version. > > In my mind come 2 options: > A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and > then set the op version to highest possible. > # gluster volume get all cluster.max-op-version > # gluster volume get all cluster.op-version > > B) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and > control the parallel connections from Ganesha). > > Can you provide your Gluster volume's options? > 'gluster volume get <VOLNAME> all' > > > Thanks again for any advice. > > > > > > > > On Mon, 23 Dec 2019 at 13:09, David Cunningham < > dcunningham at voisonics.com> wrote: > >> > >> Hi Strahil, > >> > >> Thanks for that. We do have one backup server specified, but will add > the second backup as well. > >> > >> > >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg at yahoo.com> wrote: > >>> > >>> Hi David, > >>> > >>> Also consider using the mount option to specify backup server via > 'backupvolfile-server=server2:server3' (you can define more but I don't > thing replica volumes greater that 3 are usefull (maybe in some special > cases). > >>> > >>> In such way, when the primary is lost, your client can reach a backup > one without disruption. > >>> > >>> P.S.: Client may 'hang' - if the primary server got rebooted > ungracefully - as the communication must timeout before FUSE addresses the > next server. There is a special script for killing gluster processes in > '/usr/share/gluster/scripts' which can be used for setting up a systemd > service to do that for you on shutdown. > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >>> On Dec 20, 2019 23:49, David Cunningham <dcunningham at voisonics.com> > wrote: > >>>> > >>>> Hi Stahil, > >>>> > >>>> Ah, that is an important point. One of the nodes is not accessible > from the client, and we assumed that it only needed to reach the GFS node > that was mounted so didn't think anything of it. > >>>> > >>>> We will try making all nodes accessible, as well as > "direct-io-mode=disable". > >>>> > >>>> Thank you. > >>>> > >>>> > >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg at yahoo.com> > wrote: > >>>>> > >>>>> Actually I haven't clarified myself. > >>>>> FUSE mounts on the client side is connecting directly to all bricks > consisted of the volume. > >>>>> If for some reason (bad routing, firewall blocked) there could be > cases where the client can reach 2 out of 3 bricks and this can constantly > cause healing to happen (as one of the bricks is never updated) which will > degrade the performance and cause excessive network usage. > >>>>> As your attachment is from one of the gluster nodes, this could be > the case. > >>>>> > >>>>> Best Regards, > >>>>> Strahil Nikolov > >>>>> > >>>>> ? ?????, 20 ???????? 2019 ?., 01:49:56 ?. ???????+2, David > Cunningham <dcunningham at voisonics.com> ??????: > >>>>> > >>>>> > >>>>> Hi Strahil, > >>>>> > >>>>> The chart attached to my original email is taken from the GFS server. > >>>>> > >>>>> I'm not sure what you mean by accessing all bricks simultaneously. > We've mounted it from the client like this: > >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs > defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10 > 0 0 > >>>>> > >>>>> Should we do something different to access all bricks simultaneously? > >>>>> > >>>>> Thanks for your help! > >>>>> > >>>>> > >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <hunter86_bg at yahoo.com> > wrote: > >>>>>> > >>>>>> I'm not sure if you did measure the traffic from client side > (tcpdump on a client machine) or from Server side. > >>>>>> > >>>>>> In both cases , please verify that the client accesses all bricks > simultaneously, as this can cause unnecessary heals. > >>>>>> > >>>>>> Have you thought about upgrading to v6? There are some enhancements > in v6 which could be beneficial. > >>>>>> > >>>>>> Yet, it is indeed strange that so much traffic is generated with > FUSE. > >>>>>> > >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and > can natively speak with Gluster, which cant bring you closer to the > previous setup and also provide some extra performance. > >>>>>> > >>>>>> > >>>>>> Best Regards, > >>>>>> Strahil Nikolov > >>>>>> > >>>>>> > >>>>>> > >> > >> > >> -- > >> David Cunningham, Voisonics Limited > >> http://voisonics.com/ > >> USA: +1 213 221 1092 > >> New Zealand: +64 (0)28 2558 3782 > > > > > > > > -- > > David Cunningham, Voisonics Limited > > http://voisonics.com/ > > USA: +1 213 221 1092 > > New Zealand: +64 (0)28 2558 3782 > > Best Regards, > Strahil Nikolov >-- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191227/e883e643/attachment.html>