Hi folks, i am running multiple GlusterFS servers in multiple datacenters. Every datacenter is basically the same setup: 3x storage nodes, 3x kvm hypervisors (oVirt) and 2x HPE switches which are acting as one logical unit. The NICs of all servers are attached to both switches with a bonding of two NICs, in case one of the switches has a major problem. In one datacenter i have strange problems with the glusterfs for nearly half of a year now and i'm not able to figure out the root cause. Enviorment - glusterfs 9.5 running on a centos 7.9.2009 (Core) - three gluster volumes, all options equally configured root at storage-001# gluster volume info Volume Name: g-volume-domain Type: Replicate Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain Options Reconfigured: client.event-threads: 4 performance.cache-size: 1GB server.event-threads: 4 server.allow-insecure: On network.ping-timeout: 42 performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.quick-read: off cluster.data-self-heal-algorithm: diff storage.owner-uid: 36 storage.owner-gid: 36 performance.readdir-ahead: on performance.read-ahead: off client.ssl: off server.ssl: off auth.ssl-allow: storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain ssl.cipher-list: HIGH:!SSLv2 cluster.shd-max-threads: 4 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.io-thread-count: 32 Problem The glusterd on one storage node seems to loose connection to one another storage node. If the problem occurs, the first message in /var/log/glusterfs/glusterd.log is always the following (variable values are filled with "x": [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>), in state <Peer in Cluster>, has disconnected from glusterd. I will post a filtered log for this specific error on each of my storage nodes below. storage-001: root at storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state <Peer in Cluster>, has disconnected from glusterd. root at storage-001# storage-002: root at storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in state <Peer in Cluster>, has disconnected from glusterd. root at storage-002# storage-003: root at storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state <Peer in Cluster>, has disconnected from glusterd. [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state <Peer in Cluster>, has disconnected from glusterd. root at storage-003# After this message it takes a couple secounds (in specific example of 2022-08-16 it's one to four secounds) and the disconnected node is reachable again: [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host: storage-002.my.domain, port: 0 This behavior is the same on all nodes - there is a disconnect of a gluster node and a couple secounds later the disconnected node is reachable again. After the reconnect the glustershd is invoked and heals all the data. How can i figure out the root cause of this random disconnects? My debugging actions so far: - check dmesg -> zero messages around the time of the disconnects - check the switch -> no port down/up, no packet errors - disabled ssl on the gluster volumes -> disconnects are still occuring - check the dropped/error packages on the network interface of the storage nodes -> no dropped packages, no errors - constant pingcheck between all nodes, while a disconnect occurs -> zero packet loss, zero high latencys - temporary deactivated one of the two interfaces which are building the bond -> disconnects are still occuring - updated gluster from 6.x to 9.5 -> disconnects are still occuring Important info: I can force this error to happen if i put some high i/o-load to one of the gluster volumes. I suspect there could be an issue with a network queue overflow or something like that, but that theory does not match the result of my pingcheck. What would be your next step to debug this error? Thanks in advance!
What if you renice the gluster processes to some negative value? <dpgluster at posteo.de> ? 2022?8?18??? 09:45???> Hi folks, > > i am running multiple GlusterFS servers in multiple datacenters. Every > datacenter is basically the same setup: 3x storage nodes, 3x kvm > hypervisors (oVirt) and 2x HPE switches which are acting as one logical > unit. The NICs of all servers are attached to both switches with a > bonding of two NICs, in case one of the switches has a major problem. > In one datacenter i have strange problems with the glusterfs for nearly > half of a year now and i'm not able to figure out the root cause. > > Enviorment > - glusterfs 9.5 running on a centos 7.9.2009 (Core) > - three gluster volumes, all options equally configured > > root at storage-001# gluster volume info > Volume Name: g-volume-domain > Type: Replicate > Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain > Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain > Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain > Options Reconfigured: > client.event-threads: 4 > performance.cache-size: 1GB > server.event-threads: 4 > server.allow-insecure: On > network.ping-timeout: 42 > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > cluster.quorum-type: auto > network.remote-dio: enable > cluster.eager-lock: enable > performance.stat-prefetch: off > performance.io-cache: off > performance.quick-read: off > cluster.data-self-heal-algorithm: diff > storage.owner-uid: 36 > storage.owner-gid: 36 > performance.readdir-ahead: on > performance.read-ahead: off > client.ssl: off > server.ssl: off > auth.ssl-allow: > > storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain > ssl.cipher-list: HIGH:!SSLv2 > cluster.shd-max-threads: 4 > diagnostics.latency-measurement: on > diagnostics.count-fop-hits: on > performance.io-thread-count: 32 > > Problem > The glusterd on one storage node seems to loose connection to one > another storage node. If the problem occurs, the first message in > /var/log/glusterfs/glusterd.log is always the following (variable values > are filled with "x": > [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>), in > state <Peer in Cluster>, has disconnected from glusterd. > > I will post a filtered log for this specific error on each of my storage > nodes below. > storage-001: > root at storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state <Peer in Cluster>, has disconnected from glusterd. > root at storage-001# > > storage-002: > root at storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in > state <Peer in Cluster>, has disconnected from glusterd. > root at storage-002# > > storage-003: > root at storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state <Peer in Cluster>, has disconnected from glusterd. > [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state <Peer in Cluster>, has disconnected from glusterd. > root at storage-003# > > After this message it takes a couple secounds (in specific example of > 2022-08-16 it's one to four secounds) and the disconnected node is > reachable again: > [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493] > [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: Received > ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host: > storage-002.my.domain, port: 0 > > This behavior is the same on all nodes - there is a disconnect of a > gluster node and a couple secounds later the disconnected node is > reachable again. After the reconnect the glustershd is invoked and heals > all the data. How can i figure out the root cause of this random > disconnects? > > My debugging actions so far: > - check dmesg -> zero messages around the time of the disconnects > - check the switch -> no port down/up, no packet errors > - disabled ssl on the gluster volumes -> disconnects are still occuring > - check the dropped/error packages on the network interface of the > storage nodes -> no dropped packages, no errors > - constant pingcheck between all nodes, while a disconnect occurs -> > zero packet loss, zero high latencys > - temporary deactivated one of the two interfaces which are building the > bond -> disconnects are still occuring > - updated gluster from 6.x to 9.5 -> disconnects are still occuring > > Important info: I can force this error to happen if i put some high > i/o-load to one of the gluster volumes. > > I suspect there could be an issue with a network queue overflow or > something like that, but that theory does not match the result of my > pingcheck. > > > What would be your next step to debug this error? > > > Thanks in advance! > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220818/07a626e8/attachment.html>