thr3ads.net - Gluster users - [Gluster-users] random disconnects of peers [Aug 2022]

If this information is useful, please help other people find it:
Share via:

dpgluster at posteo.de

2022-Aug-18 10:38 UTC

[Gluster-users] random disconnects of peers

I just niced all glusterfsd processes on all nodes to a value of -10. 
The problem just occured, so it seems nicing the processes didn't help.

Am 18.08.2022 09:54 schrieb P?ter K?roly JUH?SZ:> What if you renice the gluster processes to some negative value?
> 
>  <dpgluster at posteo.de> ? 2022?8?18??? 09:45???
> 
>> Hi folks,
>> 
>> i am running multiple GlusterFS servers in multiple datacenters.
>> Every
>> datacenter is basically the same setup: 3x storage nodes, 3x kvm
>> hypervisors (oVirt) and 2x HPE switches which are acting as one
>> logical
>> unit. The NICs of all servers are attached to both switches with a
>> bonding of two NICs, in case one of the switches has a major
>> problem.
>> In one datacenter i have strange problems with the glusterfs for
>> nearly
>> half of a year now and i'm not able to figure out the root cause.
>> 
>> Enviorment
>> - glusterfs 9.5 running on a centos 7.9.2009 (Core)
>> - three gluster volumes, all options equally configured
>> 
>> root at storage-001# gluster volume info
>> Volume Name: g-volume-domain
>> Type: Replicate
>> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
>> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
>> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
>> Options Reconfigured:
>> client.event-threads: 4
>> performance.cache-size: 1GB
>> server.event-threads: 4
>> server.allow-insecure: On
>> network.ping-timeout: 42
>> performance.client-io-threads: off
>> nfs.disable: on
>> transport.address-family: inet
>> cluster.quorum-type: auto
>> network.remote-dio: enable
>> cluster.eager-lock: enable
>> performance.stat-prefetch: off
>> performance.io-cache: off
>> performance.quick-read: off
>> cluster.data-self-heal-algorithm: diff
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.readdir-ahead: on
>> performance.read-ahead: off
>> client.ssl: off
>> server.ssl: off
>> auth.ssl-allow:
>> 
>
storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
>> ssl.cipher-list: HIGH:!SSLv2
>> cluster.shd-max-threads: 4
>> diagnostics.latency-measurement: on
>> diagnostics.count-fop-hits: on
>> performance.io-thread-count: 32
>> 
>> Problem
>> The glusterd on one storage node seems to loose connection to one
>> another storage node. If the problem occurs, the first message in
>> /var/log/glusterfs/glusterd.log is always the following (variable
>> values
>> are filled with "x":
>> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-00x.my.domain>
(<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> 
>> I will post a filtered log for this specific error on each of my
>> storage
>> nodes below.
>> storage-001:
>> root at storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log |
>> grep
>> "has disconnected from" | grep "2022-08-16"
>> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-003.my.domain>
(<a911feef-14c7-4740-a7ae-1d475a724c9f>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> root at storage-001#
>> 
>> storage-002:
>> root at storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log |
>> grep
>> "has disconnected from" | grep "2022-08-16"
>> [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-003.my.domain>
(<a911feef-14c7-4740-a7ae-1d475a724c9f>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> root at storage-002#
>> 
>> storage-003:
>> root at storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log |
>> grep
>> "has disconnected from" | grep "2022-08-16"
>> [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004]
>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
>> Peer
>> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
>> in
>> state <Peer in Cluster>, has disconnected from glusterd.
>> root at storage-003#
>> 
>> After this message it takes a couple secounds (in specific example
>> of
>> 2022-08-16 it's one to four secounds) and the disconnected node is
>> reachable again:
>> [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd:
>> Received
>> ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host:
>> storage-002.my.domain, port: 0
>> 
>> This behavior is the same on all nodes - there is a disconnect of a
>> 
>> gluster node and a couple secounds later the disconnected node is
>> reachable again. After the reconnect the glustershd is invoked and
>> heals
>> all the data. How can i figure out the root cause of this random
>> disconnects?
>> 
>> My debugging actions so far:
>> - check dmesg -> zero messages around the time of the disconnects
>> - check the switch -> no port down/up, no packet errors
>> - disabled ssl on the gluster volumes -> disconnects are still
>> occuring
>> - check the dropped/error packages on the network interface of the
>> storage nodes -> no dropped packages, no errors
>> - constant pingcheck between all nodes, while a disconnect occurs
>> ->
>> zero packet loss, zero high latencys
>> - temporary deactivated one of the two interfaces which are
>> building the
>> bond -> disconnects are still occuring
>> - updated gluster from 6.x to 9.5 -> disconnects are still occuring
>> 
>> Important info: I can force this error to happen if i put some high
>> 
>> i/o-load to one of the gluster volumes.
>> 
>> I suspect there could be an issue with a network queue overflow or
>> something like that, but that theory does not match the result of
>> my
>> pingcheck.
>> 
>> What would be your next step to debug this error?
>> 
>> Thanks in advance!
>> ________
>> 
>> Community Meeting Calendar:
>> 
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk [1]
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users [2]
> 
> 
> Links:
> ------
> [1] https://meet.google.com/cpu-eiue-hvk
> [2] https://lists.gluster.org/mailman/listinfo/gluster-users

Péter Károly JUHÁSZ

2022-Aug-18 11:26 UTC

head link

[Gluster-users] random disconnects of peers

Did you tired to TCPdump the connections to see who and how closes the
connection? Normal fin-ack, or timeout? Maybe some network device between?
(This later has small probably since you told that you can trigger the
error by high load.)

<dpgluster at posteo.de> ? 2022?8?18??? 12:38???
> I just niced all glusterfsd processes on all nodes to a value of -10.
> The problem just occured, so it seems nicing the processes didn't help.
>
> Am 18.08.2022 09:54 schrieb P?ter K?roly JUH?SZ:
> > What if you renice the gluster processes to some negative value?
> >
> >  <dpgluster at posteo.de> ? 2022?8?18??? 09:45???
> >
> >> Hi folks,
> >>
> >> i am running multiple GlusterFS servers in multiple datacenters.
> >> Every
> >> datacenter is basically the same setup: 3x storage nodes, 3x kvm
> >> hypervisors (oVirt) and 2x HPE switches which are acting as one
> >> logical
> >> unit. The NICs of all servers are attached to both switches with a
> >> bonding of two NICs, in case one of the switches has a major
> >> problem.
> >> In one datacenter i have strange problems with the glusterfs for
> >> nearly
> >> half of a year now and i'm not able to figure out the root
cause.
> >>
> >> Enviorment
> >> - glusterfs 9.5 running on a centos 7.9.2009 (Core)
> >> - three gluster volumes, all options equally configured
> >>
> >> root at storage-001# gluster volume info
> >> Volume Name: g-volume-domain
> >> Type: Replicate
> >> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
> >> Status: Started
> >> Snapshot Count: 0
> >> Number of Bricks: 1 x 3 = 3
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
> >> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
> >> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
> >> Options Reconfigured:
> >> client.event-threads: 4
> >> performance.cache-size: 1GB
> >> server.event-threads: 4
> >> server.allow-insecure: On
> >> network.ping-timeout: 42
> >> performance.client-io-threads: off
> >> nfs.disable: on
> >> transport.address-family: inet
> >> cluster.quorum-type: auto
> >> network.remote-dio: enable
> >> cluster.eager-lock: enable
> >> performance.stat-prefetch: off
> >> performance.io-cache: off
> >> performance.quick-read: off
> >> cluster.data-self-heal-algorithm: diff
> >> storage.owner-uid: 36
> >> storage.owner-gid: 36
> >> performance.readdir-ahead: on
> >> performance.read-ahead: off
> >> client.ssl: off
> >> server.ssl: off
> >> auth.ssl-allow:
> >>
> >
>
storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
> >> ssl.cipher-list: HIGH:!SSLv2
> >> cluster.shd-max-threads: 4
> >> diagnostics.latency-measurement: on
> >> diagnostics.count-fop-hits: on
> >> performance.io-thread-count: 32
> >>
> >> Problem
> >> The glusterd on one storage node seems to loose connection to one
> >> another storage node. If the problem occurs, the first message in
> >> /var/log/glusterfs/glusterd.log is always the following (variable
> >> values
> >> are filled with "x":
> >> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-00x.my.domain>
(<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >>
> >> I will post a filtered log for this specific error on each of my
> >> storage
> >> nodes below.
> >> storage-001:
> >> root at storage-001# tail -n 100000
/var/log/glusterfs/glusterd.log |
> >> grep
> >> "has disconnected from" | grep "2022-08-16"
> >> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-003.my.domain>
(<a911feef-14c7-4740-a7ae-1d475a724c9f>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> root at storage-001#
> >>
> >> storage-002:
> >> root at storage-002# tail -n 100000
/var/log/glusterfs/glusterd.log |
> >> grep
> >> "has disconnected from" | grep "2022-08-16"
> >> [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-003.my.domain>
(<a911feef-14c7-4740-a7ae-1d475a724c9f>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-001.my.domain>
(<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> root at storage-002#
> >>
> >> storage-003:
> >> root at storage-003# tail -n 100000
/var/log/glusterfs/glusterd.log |
> >> grep
> >> "has disconnected from" | grep "2022-08-16"
> >> [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >> <storage-002.my.domain>
(<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state <Peer in Cluster>, has disconnected from glusterd.
> >> root at storage-003#
> >>
> >> After this message it takes a couple secounds (in specific example
> >> of
> >> 2022-08-16 it's one to four secounds) and the disconnected
node is
> >> reachable again:
> >> [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493]
> >> [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd:
> >> Received
> >> ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host:
> >> storage-002.my.domain, port: 0
> >>
> >> This behavior is the same on all nodes - there is a disconnect of
a
> >>
> >> gluster node and a couple secounds later the disconnected node is
> >> reachable again. After the reconnect the glustershd is invoked and
> >> heals
> >> all the data. How can i figure out the root cause of this random
> >> disconnects?
> >>
> >> My debugging actions so far:
> >> - check dmesg -> zero messages around the time of the
disconnects
> >> - check the switch -> no port down/up, no packet errors
> >> - disabled ssl on the gluster volumes -> disconnects are still
> >> occuring
> >> - check the dropped/error packages on the network interface of the
> >> storage nodes -> no dropped packages, no errors
> >> - constant pingcheck between all nodes, while a disconnect occurs
> >> ->
> >> zero packet loss, zero high latencys
> >> - temporary deactivated one of the two interfaces which are
> >> building the
> >> bond -> disconnects are still occuring
> >> - updated gluster from 6.x to 9.5 -> disconnects are still
occuring
> >>
> >> Important info: I can force this error to happen if i put some
high
> >>
> >> i/o-load to one of the gluster volumes.
> >>
> >> I suspect there could be an issue with a network queue overflow or
> >> something like that, but that theory does not match the result of
> >> my
> >> pingcheck.
> >>
> >> What would be your next step to debug this error?
> >>
> >> Thanks in advance!
> >> ________
> >>
> >> Community Meeting Calendar:
> >>
> >> Schedule -
> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >> Bridge: https://meet.google.com/cpu-eiue-hvk [1]
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> https://lists.gluster.org/mailman/listinfo/gluster-users [2]
> >
> >
> > Links:
> > ------
> > [1] https://meet.google.com/cpu-eiue-hvk
> > [2] https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20220818/89bc19fd/attachment.html>

Gluster users - Aug 2022 - random disconnects of peers

[Gluster-users] random disconnects of peers

[Gluster-users] random disconnects of peers