denmat
2016-Oct-30 21:32 UTC
[Gluster-users] GlusterFS CPU high on one node - SETATTR Operation not permitted
Hi all, I have a 3 node gluster cluster: glusterfs 3.7.12 built on Jun 27 2016 12:40:53 3 x AWS EC2 m4.xlarge 1 x 1.5Tb STI EBS We have had several issues where GlusterFS FUSE client hosts get into 'D' on reading the file system and now have one of the 3 nodes with substantially more CPU load than the other two. This morning (several hours ago) we shut down one node exhibiting the high CPU behaviour and now that node is back to normal and one of the other 3 has high CPU. The brick logs on both the 'good' nodes are emitting this over and over. [2016-10-30 20:55:48.483045] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 676120: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted] [2016-10-30 20:55:51.511782] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 695527: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted] [2016-10-30 20:55:51.553858] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 708963: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/c42cc627-adc6-43a4-93dd-ea6ec0eaa9cb/source (eaceb846-9ee1-4fab-8f4d-e2a93e85710f) ==> (Operation not permitted) [Operation not permitted] [2016-10-30 20:55:54.660251] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 709004: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted] [2016-10-30 20:55:54.866259] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 356367: SETATTR /ftpdata/[REDACTED]/bulk_import (d7dda9df-bf56-4f71-b6b8-59f2ccc41016) ==> (Operation not permitted) [Operation not permitted] There are loads of files that are in the gluster volume heal <vol> info. Client nodes using GlusterFS FUSE clients have been hanging in an uninterruptible state causing issues. I can see files are missing on some of the GlusterFS bricks and I have used a client mounting the bricks via NFS and doing a file traverse and stat on each file and directory (find /<mountpoints> -type f -exec stat {} \;) and I can see files being healed. We are hoping that this will get us back into a normal state but we are unsure if there is a larger problem looming. I have included the statedump and current setting: Option Value ------ ----- cluster.lookup-unhashed on cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal on cluster.data-self-heal on cluster.entry-self-heal on cluster.self-heal-daemon disable cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm full cluster.eager-lock on disperse.eager-lock on cluster.quorum-type none cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.stripe-block-size 128KB cluster.stripe-coalesce true diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 1 performance.cache-priority performance.cache-size 512MB performance.io-thread-count 16 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.least-rate-limit 0 performance.cache-size 512MB performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.lazy-open yes performance.read-after-open no performance.read-ahead-page-count 4 performance.md-cache-timeout 1 performance.cache-swift-metadata true features.encryption off encryption.master-key (null) encryption.data-key-size 256 encryption.block-size 4096 network.frame-timeout 1800 network.ping-timeout 15 network.tcp-window-size (null) features.lock-heal off features.grace-timeout 10 network.remote-dio disable client.event-threads 2 network.ping-timeout 15 network.tcp-window-size (null) network.inode-lru-limit 16384 auth.allow * auth.reject (null) transport.keepalive (null) server.allow-insecure (null) server.root-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 features.lock-heal off features.grace-timeout (null) server.ssl (null) auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 2 ssl.own-cert (null) ssl.private-key (null) ssl.ca-list (null) ssl.crl-path (null) ssl.certificate-depth (null) ssl.cipher-list (null) ssl.dh-param (null) ssl.ec-curve (null) performance.write-behind on performance.read-ahead on performance.readdir-ahead on performance.io-cache on performance.quick-read on performance.open-behind on performance.stat-prefetch on performance.client-io-threads off performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache off performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true features.file-snapshot off features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.limit-usage (null) features.quota-timeout 0 features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.enable-ino32 no nfs.mem-factor 15 nfs.export-dirs on nfs.export-volumes on nfs.addr-namelookup off nfs.dynamic-volumes off nfs.register-with-portmap on nfs.outstanding-rpc-limit 16 nfs.port 2049 nfs.rpc-auth-unix on nfs.rpc-auth-null on nfs.rpc-auth-allow all nfs.rpc-auth-reject none nfs.ports-insecure off nfs.trusted-sync off nfs.trusted-write off nfs.volume-access read-write nfs.export-dir nfs.disable false nfs.nlm on nfs.acl on nfs.mount-udp off nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab nfs.rpc-statd /sbin/rpc.statd nfs.server-aux-gids off nfs.drc off nfs.drc-size 0x20000 nfs.read-size (1 * 1048576ULL) nfs.write-size (1 * 1048576ULL) nfs.readdir-size (1 * 1048576ULL) nfs.rdirplus on nfs.exports-auth-enable (null) nfs.auth-refresh-interval-sec (null) nfs.auth-cache-ttl-sec (null) features.read-only off features.worm off storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.bd-aio off cluster.server-quorum-type off cluster.server-quorum-ratio 51% changelog.changelog off changelog.changelog-dir (null) changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable cluster.write-freq-threshold 0 cluster.read-freq-threshold 0 cluster.tier-pause off cluster.tier-promote-frequency 120 cluster.tier-demote-frequency 3600 cluster.watermark-hi 90 cluster.watermark-low 75 cluster.tier-mode cache cluster.tier-max-mb 4000 cluster.tier-max-files 10000 features.ctr-enabled off features.record-counters off features.ctr-record-metadata-heat off features.ctr_link_consistency off features.ctr_lookupheal_link_timeout 300 features.ctr_lookupheal_inode_timeout 300 features.ctr-sql-db-cachesize 1000 features.ctr-sql-db-wal-autocheckpoint 1000 locks.trace (null) cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) ganesha.enable off features.shard off features.shard-block-size 4MB features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation off features.cache-invalidation-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 dht.force-readdirp on disperse.read-policy round-robin cluster.shd-max-threads 1 cluster.shd-wait-qlength 1024 cluster.locking-scheme full Help is appreciated Den -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161031/f6dfd3de/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: data-data0-marketplace_nfs.4868.dump.1477862666.gz Type: application/x-gzip Size: 585398 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161031/f6dfd3de/attachment-0001.gz>