thr3ads.net - Gluster users - [Gluster-users] Issues with glustershd with release 8.4 and 9.1 [May 2021]

If this information is useful, please help other people find it:
Share via:

Marco Fais

2021-May-19 14:32 UTC

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Hi Ravi,

thanks a million for your reply.

I have replicated the issue in my test cluster by bringing one of the nodes
down, and then up again.
The glustershd process in the restarted node is now complaining about
connectivity to two bricks in one of my volumes:

---
[2021-05-19 14:05:14.462133 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-11: changing port to 49170 (from 0)
[2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
[client-handshake.c:1128:select_server_supported_programs]
2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
{Num=1298437}, {Version=400}]
[2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
[client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11:
failed to set the volume [{errno=2}, {error=No such file or directory}]
[2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
[client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11:
failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
argument}]
[2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
[client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11:
SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
{error=No such file or directory}]
[2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
[client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11:
sending CHILD_CONNECTING event []
[2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
[client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected
from client, process will keep trying to connect glusterd until brick's
port is available [{conn-name=VM_Storage_1-client-11}]
[2021-05-19 14:05:14.473598 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-20: changing port to 49173 (from 0)
[2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
[client-handshake.c:1128:select_server_supported_programs]
2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1},
{Num=1298437}, {Version=400}]
[2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
[client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20:
failed to set the volume [{errno=2}, {error=No such file or directory}]
[2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
[client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20:
failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
argument}]
[2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
[client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20:
SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
{error=No such file or directory}]
[2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
[client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20:
sending CHILD_CONNECTING event []
[2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
[client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected
from client, process will keep trying to connect glusterd until brick's
port is available [{conn-name=VM_Storage_1-client-20}]
---

The two bricks are the following:
VM_Storage_1-client-20 --> Brick21:
lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
VM_Storage_1-client-11 --> Brick12:
lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
(In this case it the issue is on two arbiter nodes, but it is not always
the case)

The port information via "gluster volume status VM_Storage_1" on the
affected node (same as the one running the glustershd reporting the issue)
is:
Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
            *49172     *0          Y       3978256
Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
            *49169     *0          Y       3978224

This is aligned to the actual port of the process:
root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18  15:56
/usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
-S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
/bricks/vm_b5_arb/brick -l
/var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
brick --brick-port *49172 *--xlator-option
VM_Storage_1-server.listen-port*49172*
root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18  44:55
/usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
-S /var/run/gluster/00d461b7d79badc9.socket --brick-name
/bricks/vm_b3_arb/brick -l
/var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
brick --brick-port *49169 *--xlator-option
VM_Storage_1-server.listen-port*49169*

So the issue seems to be specifically on glustershd, as the *glusterd
process seems to be aware of the right port *(as it matches the real port,
and the brick is indeed up according to the status).

I have then requested a statedump as you have suggested, and the bricks
seem to be not connected:

[xlator.protocol.client.VM_Storage_1-client-11.priv]
*connected=0*
total_bytes_read=341120
ping_timeout=42
total_bytes_written=594008
ping_msgs_sent=0
msgs_sent=0

[xlator.protocol.client.VM_Storage_1-client-20.priv]
*connected=0*
total_bytes_read=341120
ping_timeout=42
total_bytes_written=594008
ping_msgs_sent=0
msgs_sent=0

The important other thing to notice is that normally the bricks that are
not connecting are always in the same (remote) node... i.e. they are both
in node 3 in this case. That seems to be always the case, I have not
encountered a scenario where bricks from different nodes are reporting this
issue (at least for the same volume).

Please let me know if you need any additional info.

Regards,
Marco


On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishankar at redhat.com>
wrote:
>
>
> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com>
wrote:
>
>> Hi,
>>
>> I am having significant issues with glustershd with releases 8.4 and
9.1.
>>
>> My oVirt clusters are using gluster storage backends, and were running
>> fine with Gluster 7.x (shipped with earlier versions of oVirt Node
4.4.x).
>> Recently the oVirt project moved to Gluster 8.4 for the nodes, and
hence I
>> have moved to this release when upgrading my clusters.
>>
>> Since then I am having issues whenever one of the nodes is brought
down;
>> when the nodes come back up online the bricks are typically back up and
>> working, but some (random) glustershd processes in the various nodes
seem
>> to have issues connecting to some of them.
>>
>>
> When the issue happens, can you check if the TCP port number of the brick
> (glusterfsd) processes displayed in `gluster volume status` matches with
> that of the actual port numbers observed (i.e. the --brick-port argument)
> when you run `ps aux | grep glusterfsd` ? If they don't match, then
> glusterd has incorrect brick port information in its memory and serving it
> to glustershd. Restarting glusterd instead of (killing the bricks + `volume
> start force`) should fix it, although we need to find why glusterd serves
> incorrect port numbers.
>
> If they do match, then can you take a statedump of glustershd to check
> that it is indeed disconnected from the bricks? You will need to verify
> that 'connected=1' in the statedump. See "Self-heal is stuck/
not getting
> completed." section in
> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
> generated in the /var/run/gluster/ directory.
>
> Regards,
> Ravi
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210519/a12a1b58/attachment.html>

Ravishankar N

2021-May-20 13:01 UTC

head link

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Hi Marco,

On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at gmail.com> wrote:
> Hi Ravi,
>
> thanks a million for your reply.
>
> I have replicated the issue in my test cluster by bringing one of the
> nodes down, and then up again.
> The glustershd process in the restarted node is now complaining about
> connectivity to two bricks in one of my volumes:
>
> ---
> [2021-05-19 14:05:14.462133 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig]
> 2-VM_Storage_1-client-11: changing port to 49170 (from 0)
> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
> [client-handshake.c:1128:select_server_supported_programs]
> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
> {Num=1298437}, {Version=400}]
> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11:
> failed to set the volume [{errno=2}, {error=No such file or directory}]
> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11:
> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
> argument}]
> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11:
> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
> {error=No such file or directory}]
> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11:
> sending CHILD_CONNECTING event []
> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected
> from client, process will keep trying to connect glusterd until brick's
> port is available [{conn-name=VM_Storage_1-client-11}]
> [2021-05-19 14:05:14.473598 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig]
> 2-VM_Storage_1-client-20: changing port to 49173 (from 0)
>
The above logs indicate that shd is trying to connect to the bricks on
ports 49170 and 49173 respectively, when it should have done so using 49172
and 49169 (as per the volume status and ps output). Shd gets the brick port
numbers info from glusterd, so I'm not sure what is going on here.  Do you
have fuse mounts on this particular node?  If you don't, you can mount it
temporarily, then check if the connection to the bricks is successful from
the .meta folder of the mount:

cd /path-to-fuse-mount
cd .meta
cat graphs/active/VM_Storage_1-client-11/private
cat graphs/active/VM_Storage_1-client-20/private
etc. and check if connected=1 or 0.

I just wanted to see if it is only the shd or even the other clients are
unable to connect to the bricks from this node. FWIW, I tried upgrading
from 7.9 to 8.4 on a test machine and the shd was able to connect to the
bricks just fine.
Regards,
Ravi



> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
> [client-handshake.c:1128:select_server_supported_programs]
> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1},
> {Num=1298437}, {Version=400}]
> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20:
> failed to set the volume [{errno=2}, {error=No such file or directory}]
> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20:
> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
> argument}]
> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20:
> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
> {error=No such file or directory}]
> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20:
> sending CHILD_CONNECTING event []
> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected
> from client, process will keep trying to connect glusterd until brick's
> port is available [{conn-name=VM_Storage_1-client-20}]
> ---
>
> The two bricks are the following:
> VM_Storage_1-client-20 --> Brick21:
> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
> VM_Storage_1-client-11 --> Brick12:
> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
> (In this case it the issue is on two arbiter nodes, but it is not always
> the case)
>
> The port information via "gluster volume status VM_Storage_1" on
the
> affected node (same as the one running the glustershd reporting the issue)
> is:
> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>             *49172     *0          Y       3978256
> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>             *49169     *0          Y       3978224
>
> This is aligned to the actual port of the process:
> root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18  15:56
> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
> /bricks/vm_b5_arb/brick -l
> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option
> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
> brick --brick-port *49172 *--xlator-option
> VM_Storage_1-server.listen-port=*49172*
> root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18  44:55
> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name
> /bricks/vm_b3_arb/brick -l
> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option
> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
> brick --brick-port *49169 *--xlator-option
> VM_Storage_1-server.listen-port=*49169*
>
> So the issue seems to be specifically on glustershd, as the *glusterd
> process seems to be aware of the right port *(as it matches the real
> port, and the brick is indeed up according to the status).
>
> I have then requested a statedump as you have suggested, and the bricks
> seem to be not connected:
>
> [xlator.protocol.client.VM_Storage_1-client-11.priv]
> *connected=0*
> total_bytes_read=341120
> ping_timeout=42
> total_bytes_written=594008
> ping_msgs_sent=0
> msgs_sent=0
>
> [xlator.protocol.client.VM_Storage_1-client-20.priv]
> *connected=0*
> total_bytes_read=341120
> ping_timeout=42
> total_bytes_written=594008
> ping_msgs_sent=0
> msgs_sent=0
>
> The important other thing to notice is that normally the bricks that are
> not connecting are always in the same (remote) node... i.e. they are both
> in node 3 in this case. That seems to be always the case, I have not
> encountered a scenario where bricks from different nodes are reporting this
> issue (at least for the same volume).
>
> Please let me know if you need any additional info.
>
> Regards,
> Marco
>
>
> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishankar at
redhat.com>
> wrote:
>
>>
>>
>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com>
wrote:
>>
>>> Hi,
>>>
>>> I am having significant issues with glustershd with releases 8.4
and 9.1.
>>>
>>> My oVirt clusters are using gluster storage backends, and were
running
>>> fine with Gluster 7.x (shipped with earlier versions of oVirt Node
4.4.x).
>>> Recently the oVirt project moved to Gluster 8.4 for the nodes, and
hence I
>>> have moved to this release when upgrading my clusters.
>>>
>>> Since then I am having issues whenever one of the nodes is brought
down;
>>> when the nodes come back up online the bricks are typically back up
and
>>> working, but some (random) glustershd processes in the various
nodes seem
>>> to have issues connecting to some of them.
>>>
>>>
>> When the issue happens, can you check if the TCP port number of the
brick
>> (glusterfsd) processes displayed in `gluster volume status` matches
with
>> that of the actual port numbers observed (i.e. the --brick-port
argument)
>> when you run `ps aux | grep glusterfsd` ? If they don't match, then
>> glusterd has incorrect brick port information in its memory and serving
it
>> to glustershd. Restarting glusterd instead of (killing the bricks +
`volume
>> start force`) should fix it, although we need to find why glusterd
serves
>> incorrect port numbers.
>>
>> If they do match, then can you take a statedump of glustershd to check
>> that it is indeed disconnected from the bricks? You will need to verify
>> that 'connected=1' in the statedump. See "Self-heal is
stuck/ not getting
>> completed." section in
>>
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will
be
>> generated in the /var/run/gluster/ directory.
>>
>> Regards,
>> Ravi
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210520/963f31a5/attachment.html>

Gluster users - May 2021 - Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1