thr3ads.net - Gluster users - [Gluster-users] One of cluster work super slow (v6.8) [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Pavel Znamensky

2020-Jun-23 16:31 UTC

[Gluster-users] One of cluster work super slow (v6.8)

Hi all,
There's something strange with one of our clusters and glusterfs version
6.8: it's quite slow and one node is overloaded.
This is distributed cluster with four servers with the same
specs/OS/versions:

Volume Name: st2
Type: Distributed-Replicate
Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: st2a:/vol3/st2
Brick2: st2b:/vol3/st2
Brick3: st2c:/vol3/st2
Brick4: st2d:/vol3/st2
Options Reconfigured:
cluster.rebal-throttle: aggressive
nfs.disable: on
performance.readdir-ahead: off
transport.address-family: inet6
performance.quick-read: off
performance.cache-size: 1GB
performance.io-cache: on
performance.io-thread-count: 16
cluster.data-self-heal-algorithm: full
network.ping-timeout: 20
server.event-threads: 2
client.event-threads: 2
cluster.readdir-optimize: on
performance.read-ahead: off
performance.parallel-readdir: on
cluster.self-heal-daemon: enable
storage.health-check-timeout: 20

op.version for this cluster remains 50400

st2a is a replica for the st2b and st2c is a replica for st2d.
All our 50 clients mount this volume using FUSE and in contrast with other
our cluster this one works terrible slow.
Interesting thing here is that there are very low HDDs and network
utilization from one hand and quite overloaded server from another hand.
Also, there are no files which should be healed according to `gluster
volume heal st2 info`.
Load average across servers:
st2a:
load average: 28,73, 26,39, 27,44
st2b:
load average: 0,24, 0,46, 0,76
st2c:
load average: 0,13, 0,20, 0,27
st2d:
load average:2,93, 2,11, 1,50

If we stop glusterfs on st2a server the cluster will work as fast as we
expected.
Previously the cluster worked on a version 5.x and there were no such
problems.

Interestingly, that almost all CPU usage on st2a generates by a
"system"
load.
The most CPU intensive process is glusterfsd.
`top -H` for glusterfsd process shows this:

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND

13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
glfs_iotwr00a
13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
glfs_iotwr004
13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
glfs_iotwr007
13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
glfs_iotwr00f
13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
glfs_iotwr00d
13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
glfs_iotwr00c
13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
glfs_iotwr000
13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
glfs_iotwr005
13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
glfs_iotwr003
13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
glfs_iotwr001
13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
glfs_iotwr008
13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
glfs_iotwr006
13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
glfs_iotwr00b
13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
glfs_iotwr009
13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
glfs_iotwr00e
13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
glfs_iotwr002
13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
glfs_rpcrqhnd
13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
glfs_epoll000
13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
glfs_epoll001
13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
glfs_rpcrqhnd
13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
glusterfsd
13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
glfs_timer
13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
glfs_sigwait
13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
glfs_memsweep
13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
glfs_sproc0

Also I didn't find relevant messages in log files.
Honestly, don't know what to do. Does someone know how to debug or fix this
behaviour?

Best regards,
Pavel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200623/20f36036/attachment.html>

Strahil Nikolov

2020-Jun-23 17:42 UTC

head link

[Gluster-users] One of cluster work super slow (v6.8)

What  is the OS and it's version ?
I have seen similar behaviour (different workload)  on RHEL 7.6 (and below).

Have you checked  what processes are in 'R' or 'D' state on 
st2a  ?

Best Regards,
Strahil Nikolov

?? 23 ??? 2020 ?. 19:31:12 GMT+03:00, Pavel Znamensky <kompastver at
gmail.com> ??????:>Hi all,
>There's something strange with one of our clusters and glusterfs
>version
>6.8: it's quite slow and one node is overloaded.
>This is distributed cluster with four servers with the same
>specs/OS/versions:
>
>Volume Name: st2
>Type: Distributed-Replicate
>Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 2 x 2 = 4
>Transport-type: tcp
>Bricks:
>Brick1: st2a:/vol3/st2
>Brick2: st2b:/vol3/st2
>Brick3: st2c:/vol3/st2
>Brick4: st2d:/vol3/st2
>Options Reconfigured:
>cluster.rebal-throttle: aggressive
>nfs.disable: on
>performance.readdir-ahead: off
>transport.address-family: inet6
>performance.quick-read: off
>performance.cache-size: 1GB
>performance.io-cache: on
>performance.io-thread-count: 16
>cluster.data-self-heal-algorithm: full
>network.ping-timeout: 20
>server.event-threads: 2
>client.event-threads: 2
>cluster.readdir-optimize: on
>performance.read-ahead: off
>performance.parallel-readdir: on
>cluster.self-heal-daemon: enable
>storage.health-check-timeout: 20
>
>op.version for this cluster remains 50400
>
>st2a is a replica for the st2b and st2c is a replica for st2d.
>All our 50 clients mount this volume using FUSE and in contrast with
>other
>our cluster this one works terrible slow.
>Interesting thing here is that there are very low HDDs and network
>utilization from one hand and quite overloaded server from another
>hand.
>Also, there are no files which should be healed according to `gluster
>volume heal st2 info`.
>Load average across servers:
>st2a:
>load average: 28,73, 26,39, 27,44
>st2b:
>load average: 0,24, 0,46, 0,76
>st2c:
>load average: 0,13, 0,20, 0,27
>st2d:
>load average:2,93, 2,11, 1,50
>
>If we stop glusterfs on st2a server the cluster will work as fast as we
>expected.
>Previously the cluster worked on a version 5.x and there were no such
>problems.
>
>Interestingly, that almost all CPU usage on st2a generates by a
>"system"
>load.
>The most CPU intensive process is glusterfsd.
>`top -H` for glusterfsd process shows this:
>
>PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+
>COMMAND
>
>13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
>glfs_iotwr00a
>13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
>glfs_iotwr004
>13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
>glfs_iotwr007
>13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
>glfs_iotwr00f
>13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
>glfs_iotwr00d
>13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
>glfs_iotwr00c
>13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
>glfs_iotwr000
>13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
>glfs_iotwr005
>13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
>glfs_iotwr003
>13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
>glfs_iotwr001
>13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
>glfs_iotwr008
>13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
>glfs_iotwr006
>13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
>glfs_iotwr00b
>13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
>glfs_iotwr009
>13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
>glfs_iotwr00e
>13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
>glfs_iotwr002
>13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
>glfs_rpcrqhnd
>13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
>glfs_epoll000
>13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
>glfs_epoll001
>13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
>glfs_rpcrqhnd
>13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
>glusterfsd
>13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
>glfs_timer
>13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
>glfs_sigwait
>13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
>glfs_memsweep
>13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
>glfs_sproc0
>
>Also I didn't find relevant messages in log files.
>Honestly, don't know what to do. Does someone know how to debug or fix
>this
>behaviour?
>
>Best regards,
>Pavel

Pavel Znamensky

2020-Jul-16 08:19 UTC

head link

[Gluster-users] One of cluster work super slow (v6.8)

Sorry for the delay. Somehow Gmail decided to put almost all email from
this list to spam.
Anyway, yes, I checked the processes. Gluster processes are in 'R'
state,
the others in 'S' state.
You can find 'top -H' output in the first message.
We're running glusterfs 6.8 on CentOS 7.8. Linux kernel 4.19.

Thanks.


??, 23 ???. 2020 ?. ? 21:49, Strahil Nikolov <hunter86_bg at yahoo.com>:
> What  is the OS and it's version ?
> I have seen similar behaviour (different workload)  on RHEL 7.6 (and
> below).
>
> Have you checked  what processes are in 'R' or 'D' state on
st2a  ?
>
> Best Regards,
> Strahil Nikolov
>
> ?? 23 ??? 2020 ?. 19:31:12 GMT+03:00, Pavel Znamensky <
> kompastver at gmail.com> ??????:
> >Hi all,
> >There's something strange with one of our clusters and glusterfs
> >version
> >6.8: it's quite slow and one node is overloaded.
> >This is distributed cluster with four servers with the same
> >specs/OS/versions:
> >
> >Volume Name: st2
> >Type: Distributed-Replicate
> >Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
> >Status: Started
> >Snapshot Count: 0
> >Number of Bricks: 2 x 2 = 4
> >Transport-type: tcp
> >Bricks:
> >Brick1: st2a:/vol3/st2
> >Brick2: st2b:/vol3/st2
> >Brick3: st2c:/vol3/st2
> >Brick4: st2d:/vol3/st2
> >Options Reconfigured:
> >cluster.rebal-throttle: aggressive
> >nfs.disable: on
> >performance.readdir-ahead: off
> >transport.address-family: inet6
> >performance.quick-read: off
> >performance.cache-size: 1GB
> >performance.io-cache: on
> >performance.io-thread-count: 16
> >cluster.data-self-heal-algorithm: full
> >network.ping-timeout: 20
> >server.event-threads: 2
> >client.event-threads: 2
> >cluster.readdir-optimize: on
> >performance.read-ahead: off
> >performance.parallel-readdir: on
> >cluster.self-heal-daemon: enable
> >storage.health-check-timeout: 20
> >
> >op.version for this cluster remains 50400
> >
> >st2a is a replica for the st2b and st2c is a replica for st2d.
> >All our 50 clients mount this volume using FUSE and in contrast with
> >other
> >our cluster this one works terrible slow.
> >Interesting thing here is that there are very low HDDs and network
> >utilization from one hand and quite overloaded server from another
> >hand.
> >Also, there are no files which should be healed according to `gluster
> >volume heal st2 info`.
> >Load average across servers:
> >st2a:
> >load average: 28,73, 26,39, 27,44
> >st2b:
> >load average: 0,24, 0,46, 0,76
> >st2c:
> >load average: 0,13, 0,20, 0,27
> >st2d:
> >load average:2,93, 2,11, 1,50
> >
> >If we stop glusterfs on st2a server the cluster will work as fast as we
> >expected.
> >Previously the cluster worked on a version 5.x and there were no such
> >problems.
> >
> >Interestingly, that almost all CPU usage on st2a generates by a
> >"system"
> >load.
> >The most CPU intensive process is glusterfsd.
> >`top -H` for glusterfsd process shows this:
> >
> >PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+
> >COMMAND
> >
> >13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
> >glfs_iotwr00a
> >13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
> >glfs_iotwr004
> >13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
> >glfs_iotwr007
> >13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
> >glfs_iotwr00f
> >13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
> >glfs_iotwr00d
> >13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
> >glfs_iotwr00c
> >13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
> >glfs_iotwr000
> >13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
> >glfs_iotwr005
> >13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
> >glfs_iotwr003
> >13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
> >glfs_iotwr001
> >13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
> >glfs_iotwr008
> >13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
> >glfs_iotwr006
> >13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
> >glfs_iotwr00b
> >13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
> >glfs_iotwr009
> >13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
> >glfs_iotwr00e
> >13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
> >glfs_iotwr002
> >13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
> >glfs_rpcrqhnd
> >13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
> >glfs_epoll000
> >13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
> >glfs_epoll001
> >13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
> >glfs_rpcrqhnd
> >13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
> >glusterfsd
> >13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
> >glfs_timer
> >13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
> >glfs_sigwait
> >13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
> >glfs_memsweep
> >13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
> >glfs_sproc0
> >
> >Also I didn't find relevant messages in log files.
> >Honestly, don't know what to do. Does someone know how to debug or
fix
> >this
> >behaviour?
> >
> >Best regards,
> >Pavel
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200716/1be0173a/attachment.html>

Gluster users - Jul 2020 - One of cluster work super slow (v6.8)

[Gluster-users] One of cluster work super slow (v6.8)

[Gluster-users] One of cluster work super slow (v6.8)

[Gluster-users] One of cluster work super slow (v6.8)