Pavel Znamensky
2020-Jun-23 16:31 UTC
[Gluster-users] One of cluster work super slow (v6.8)
Hi all, There's something strange with one of our clusters and glusterfs version 6.8: it's quite slow and one node is overloaded. This is distributed cluster with four servers with the same specs/OS/versions: Volume Name: st2 Type: Distributed-Replicate Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: st2a:/vol3/st2 Brick2: st2b:/vol3/st2 Brick3: st2c:/vol3/st2 Brick4: st2d:/vol3/st2 Options Reconfigured: cluster.rebal-throttle: aggressive nfs.disable: on performance.readdir-ahead: off transport.address-family: inet6 performance.quick-read: off performance.cache-size: 1GB performance.io-cache: on performance.io-thread-count: 16 cluster.data-self-heal-algorithm: full network.ping-timeout: 20 server.event-threads: 2 client.event-threads: 2 cluster.readdir-optimize: on performance.read-ahead: off performance.parallel-readdir: on cluster.self-heal-daemon: enable storage.health-check-timeout: 20 op.version for this cluster remains 50400 st2a is a replica for the st2b and st2c is a replica for st2d. All our 50 clients mount this volume using FUSE and in contrast with other our cluster this one works terrible slow. Interesting thing here is that there are very low HDDs and network utilization from one hand and quite overloaded server from another hand. Also, there are no files which should be healed according to `gluster volume heal st2 info`. Load average across servers: st2a: load average: 28,73, 26,39, 27,44 st2b: load average: 0,24, 0,46, 0,76 st2c: load average: 0,13, 0,20, 0,27 st2d: load average:2,93, 2,11, 1,50 If we stop glusterfs on st2a server the cluster will work as fast as we expected. Previously the cluster worked on a version 5.x and there were no such problems. Interestingly, that almost all CPU usage on st2a generates by a "system" load. The most CPU intensive process is glusterfsd. `top -H` for glusterfsd process shows this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13894 root 20 0 2172892 96488 9056 R 74,0 0,1 122:09.14 glfs_iotwr00a 13888 root 20 0 2172892 96488 9056 R 73,7 0,1 121:38.26 glfs_iotwr004 13891 root 20 0 2172892 96488 9056 R 73,7 0,1 121:53.83 glfs_iotwr007 13920 root 20 0 2172892 96488 9056 R 73,0 0,1 122:11.27 glfs_iotwr00f 13897 root 20 0 2172892 96488 9056 R 68,3 0,1 121:09.82 glfs_iotwr00d 13896 root 20 0 2172892 96488 9056 R 68,0 0,1 122:03.99 glfs_iotwr00c 13868 root 20 0 2172892 96488 9056 R 67,7 0,1 122:42.55 glfs_iotwr000 13889 root 20 0 2172892 96488 9056 R 67,3 0,1 122:17.02 glfs_iotwr005 13887 root 20 0 2172892 96488 9056 R 67,0 0,1 122:29.88 glfs_iotwr003 13885 root 20 0 2172892 96488 9056 R 65,0 0,1 122:04.85 glfs_iotwr001 13892 root 20 0 2172892 96488 9056 R 55,0 0,1 121:15.23 glfs_iotwr008 13890 root 20 0 2172892 96488 9056 R 54,7 0,1 121:27.88 glfs_iotwr006 13895 root 20 0 2172892 96488 9056 R 54,0 0,1 121:28.35 glfs_iotwr00b 13893 root 20 0 2172892 96488 9056 R 53,0 0,1 122:23.12 glfs_iotwr009 13898 root 20 0 2172892 96488 9056 R 52,0 0,1 122:30.67 glfs_iotwr00e 13886 root 20 0 2172892 96488 9056 R 41,3 0,1 121:26.97 glfs_iotwr002 13878 root 20 0 2172892 96488 9056 S 1,0 0,1 1:20.34 glfs_rpcrqhnd 13840 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.54 glfs_epoll000 13841 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.14 glfs_epoll001 13877 root 20 0 2172892 96488 9056 S 0,3 0,1 1:20.02 glfs_rpcrqhnd 13833 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glusterfsd 13834 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.14 glfs_timer 13835 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glfs_sigwait 13836 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.16 glfs_memsweep 13837 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.05 glfs_sproc0 Also I didn't find relevant messages in log files. Honestly, don't know what to do. Does someone know how to debug or fix this behaviour? Best regards, Pavel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200623/20f36036/attachment.html>
Strahil Nikolov
2020-Jun-23 17:42 UTC
[Gluster-users] One of cluster work super slow (v6.8)
What is the OS and it's version ? I have seen similar behaviour (different workload) on RHEL 7.6 (and below). Have you checked what processes are in 'R' or 'D' state on st2a ? Best Regards, Strahil Nikolov ?? 23 ??? 2020 ?. 19:31:12 GMT+03:00, Pavel Znamensky <kompastver at gmail.com> ??????:>Hi all, >There's something strange with one of our clusters and glusterfs >version >6.8: it's quite slow and one node is overloaded. >This is distributed cluster with four servers with the same >specs/OS/versions: > >Volume Name: st2 >Type: Distributed-Replicate >Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45 >Status: Started >Snapshot Count: 0 >Number of Bricks: 2 x 2 = 4 >Transport-type: tcp >Bricks: >Brick1: st2a:/vol3/st2 >Brick2: st2b:/vol3/st2 >Brick3: st2c:/vol3/st2 >Brick4: st2d:/vol3/st2 >Options Reconfigured: >cluster.rebal-throttle: aggressive >nfs.disable: on >performance.readdir-ahead: off >transport.address-family: inet6 >performance.quick-read: off >performance.cache-size: 1GB >performance.io-cache: on >performance.io-thread-count: 16 >cluster.data-self-heal-algorithm: full >network.ping-timeout: 20 >server.event-threads: 2 >client.event-threads: 2 >cluster.readdir-optimize: on >performance.read-ahead: off >performance.parallel-readdir: on >cluster.self-heal-daemon: enable >storage.health-check-timeout: 20 > >op.version for this cluster remains 50400 > >st2a is a replica for the st2b and st2c is a replica for st2d. >All our 50 clients mount this volume using FUSE and in contrast with >other >our cluster this one works terrible slow. >Interesting thing here is that there are very low HDDs and network >utilization from one hand and quite overloaded server from another >hand. >Also, there are no files which should be healed according to `gluster >volume heal st2 info`. >Load average across servers: >st2a: >load average: 28,73, 26,39, 27,44 >st2b: >load average: 0,24, 0,46, 0,76 >st2c: >load average: 0,13, 0,20, 0,27 >st2d: >load average:2,93, 2,11, 1,50 > >If we stop glusterfs on st2a server the cluster will work as fast as we >expected. >Previously the cluster worked on a version 5.x and there were no such >problems. > >Interestingly, that almost all CPU usage on st2a generates by a >"system" >load. >The most CPU intensive process is glusterfsd. >`top -H` for glusterfsd process shows this: > >PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >COMMAND > >13894 root 20 0 2172892 96488 9056 R 74,0 0,1 122:09.14 >glfs_iotwr00a >13888 root 20 0 2172892 96488 9056 R 73,7 0,1 121:38.26 >glfs_iotwr004 >13891 root 20 0 2172892 96488 9056 R 73,7 0,1 121:53.83 >glfs_iotwr007 >13920 root 20 0 2172892 96488 9056 R 73,0 0,1 122:11.27 >glfs_iotwr00f >13897 root 20 0 2172892 96488 9056 R 68,3 0,1 121:09.82 >glfs_iotwr00d >13896 root 20 0 2172892 96488 9056 R 68,0 0,1 122:03.99 >glfs_iotwr00c >13868 root 20 0 2172892 96488 9056 R 67,7 0,1 122:42.55 >glfs_iotwr000 >13889 root 20 0 2172892 96488 9056 R 67,3 0,1 122:17.02 >glfs_iotwr005 >13887 root 20 0 2172892 96488 9056 R 67,0 0,1 122:29.88 >glfs_iotwr003 >13885 root 20 0 2172892 96488 9056 R 65,0 0,1 122:04.85 >glfs_iotwr001 >13892 root 20 0 2172892 96488 9056 R 55,0 0,1 121:15.23 >glfs_iotwr008 >13890 root 20 0 2172892 96488 9056 R 54,7 0,1 121:27.88 >glfs_iotwr006 >13895 root 20 0 2172892 96488 9056 R 54,0 0,1 121:28.35 >glfs_iotwr00b >13893 root 20 0 2172892 96488 9056 R 53,0 0,1 122:23.12 >glfs_iotwr009 >13898 root 20 0 2172892 96488 9056 R 52,0 0,1 122:30.67 >glfs_iotwr00e >13886 root 20 0 2172892 96488 9056 R 41,3 0,1 121:26.97 >glfs_iotwr002 >13878 root 20 0 2172892 96488 9056 S 1,0 0,1 1:20.34 >glfs_rpcrqhnd >13840 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.54 >glfs_epoll000 >13841 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.14 >glfs_epoll001 >13877 root 20 0 2172892 96488 9056 S 0,3 0,1 1:20.02 >glfs_rpcrqhnd >13833 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 >glusterfsd >13834 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.14 >glfs_timer >13835 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 >glfs_sigwait >13836 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.16 >glfs_memsweep >13837 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.05 >glfs_sproc0 > >Also I didn't find relevant messages in log files. >Honestly, don't know what to do. Does someone know how to debug or fix >this >behaviour? > >Best regards, >Pavel
Pavel Znamensky
2020-Jul-16 08:19 UTC
[Gluster-users] One of cluster work super slow (v6.8)
Sorry for the delay. Somehow Gmail decided to put almost all email from this list to spam. Anyway, yes, I checked the processes. Gluster processes are in 'R' state, the others in 'S' state. You can find 'top -H' output in the first message. We're running glusterfs 6.8 on CentOS 7.8. Linux kernel 4.19. Thanks. ??, 23 ???. 2020 ?. ? 21:49, Strahil Nikolov <hunter86_bg at yahoo.com>:> What is the OS and it's version ? > I have seen similar behaviour (different workload) on RHEL 7.6 (and > below). > > Have you checked what processes are in 'R' or 'D' state on st2a ? > > Best Regards, > Strahil Nikolov > > ?? 23 ??? 2020 ?. 19:31:12 GMT+03:00, Pavel Znamensky < > kompastver at gmail.com> ??????: > >Hi all, > >There's something strange with one of our clusters and glusterfs > >version > >6.8: it's quite slow and one node is overloaded. > >This is distributed cluster with four servers with the same > >specs/OS/versions: > > > >Volume Name: st2 > >Type: Distributed-Replicate > >Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45 > >Status: Started > >Snapshot Count: 0 > >Number of Bricks: 2 x 2 = 4 > >Transport-type: tcp > >Bricks: > >Brick1: st2a:/vol3/st2 > >Brick2: st2b:/vol3/st2 > >Brick3: st2c:/vol3/st2 > >Brick4: st2d:/vol3/st2 > >Options Reconfigured: > >cluster.rebal-throttle: aggressive > >nfs.disable: on > >performance.readdir-ahead: off > >transport.address-family: inet6 > >performance.quick-read: off > >performance.cache-size: 1GB > >performance.io-cache: on > >performance.io-thread-count: 16 > >cluster.data-self-heal-algorithm: full > >network.ping-timeout: 20 > >server.event-threads: 2 > >client.event-threads: 2 > >cluster.readdir-optimize: on > >performance.read-ahead: off > >performance.parallel-readdir: on > >cluster.self-heal-daemon: enable > >storage.health-check-timeout: 20 > > > >op.version for this cluster remains 50400 > > > >st2a is a replica for the st2b and st2c is a replica for st2d. > >All our 50 clients mount this volume using FUSE and in contrast with > >other > >our cluster this one works terrible slow. > >Interesting thing here is that there are very low HDDs and network > >utilization from one hand and quite overloaded server from another > >hand. > >Also, there are no files which should be healed according to `gluster > >volume heal st2 info`. > >Load average across servers: > >st2a: > >load average: 28,73, 26,39, 27,44 > >st2b: > >load average: 0,24, 0,46, 0,76 > >st2c: > >load average: 0,13, 0,20, 0,27 > >st2d: > >load average:2,93, 2,11, 1,50 > > > >If we stop glusterfs on st2a server the cluster will work as fast as we > >expected. > >Previously the cluster worked on a version 5.x and there were no such > >problems. > > > >Interestingly, that almost all CPU usage on st2a generates by a > >"system" > >load. > >The most CPU intensive process is glusterfsd. > >`top -H` for glusterfsd process shows this: > > > >PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > >COMMAND > > > >13894 root 20 0 2172892 96488 9056 R 74,0 0,1 122:09.14 > >glfs_iotwr00a > >13888 root 20 0 2172892 96488 9056 R 73,7 0,1 121:38.26 > >glfs_iotwr004 > >13891 root 20 0 2172892 96488 9056 R 73,7 0,1 121:53.83 > >glfs_iotwr007 > >13920 root 20 0 2172892 96488 9056 R 73,0 0,1 122:11.27 > >glfs_iotwr00f > >13897 root 20 0 2172892 96488 9056 R 68,3 0,1 121:09.82 > >glfs_iotwr00d > >13896 root 20 0 2172892 96488 9056 R 68,0 0,1 122:03.99 > >glfs_iotwr00c > >13868 root 20 0 2172892 96488 9056 R 67,7 0,1 122:42.55 > >glfs_iotwr000 > >13889 root 20 0 2172892 96488 9056 R 67,3 0,1 122:17.02 > >glfs_iotwr005 > >13887 root 20 0 2172892 96488 9056 R 67,0 0,1 122:29.88 > >glfs_iotwr003 > >13885 root 20 0 2172892 96488 9056 R 65,0 0,1 122:04.85 > >glfs_iotwr001 > >13892 root 20 0 2172892 96488 9056 R 55,0 0,1 121:15.23 > >glfs_iotwr008 > >13890 root 20 0 2172892 96488 9056 R 54,7 0,1 121:27.88 > >glfs_iotwr006 > >13895 root 20 0 2172892 96488 9056 R 54,0 0,1 121:28.35 > >glfs_iotwr00b > >13893 root 20 0 2172892 96488 9056 R 53,0 0,1 122:23.12 > >glfs_iotwr009 > >13898 root 20 0 2172892 96488 9056 R 52,0 0,1 122:30.67 > >glfs_iotwr00e > >13886 root 20 0 2172892 96488 9056 R 41,3 0,1 121:26.97 > >glfs_iotwr002 > >13878 root 20 0 2172892 96488 9056 S 1,0 0,1 1:20.34 > >glfs_rpcrqhnd > >13840 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.54 > >glfs_epoll000 > >13841 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.14 > >glfs_epoll001 > >13877 root 20 0 2172892 96488 9056 S 0,3 0,1 1:20.02 > >glfs_rpcrqhnd > >13833 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 > >glusterfsd > >13834 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.14 > >glfs_timer > >13835 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 > >glfs_sigwait > >13836 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.16 > >glfs_memsweep > >13837 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.05 > >glfs_sproc0 > > > >Also I didn't find relevant messages in log files. > >Honestly, don't know what to do. Does someone know how to debug or fix > >this > >behaviour? > > > >Best regards, > >Pavel >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200716/1be0173a/attachment.html>