thr3ads.net - Gluster users - [Gluster-users] Issues with glustershd with release 8.4 and 9.1 [May 2021]

If this information is useful, please help other people find it:
Share via:

Ravishankar N

2021-May-25 16:27 UTC

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Hi Marco,
I haven't had any luck yet.  Adding Mohit and Srijan who work in glusterd
in case they have some inputs.
-Ravi


On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com> wrote:
> Hi Ravi
>
> just wondering if you have any further thoughts on this -- unfortunately
> it is something still very much affecting us at the moment.
> I am trying to understand how to troubleshoot it further but haven't
been
> able to make much progress...
>
> Thanks,
> Marco
>
>
> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com> wrote:
>
>> Just to complete...
>>
>> from the FUSE mount log on server 2 I see the same errors as in
>> glustershd.log on node 1:
>>
>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020]
>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are
>> ready, attempting connect on transport []
>> [2021-05-20 17:58:34.160586 +0000] I
[rpc-clnt.c:1968:rpc_clnt_reconfig]
>> 0-VM_Storage_1-client-11: changing port to 49170 (from 0)
>> [2021-05-20 17:58:34.160608 +0000] I [socket.c:849:__socket_shutdown]
>> 0-VM_Storage_1-client-11: intentional socket shutdown(20)
>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046]
>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10:
>> Connected, attached to remote volume
[{conn-name=VM_Storage_1-client-10},
>> {remote_subvol=/bricks/vm_b3_vol/brick}]
>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002]
>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3:
Client-quorum is
>> met
>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020]
>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are
>> ready, attempting connect on transport []
>> [2021-05-20 17:58:34.162491 +0000] I
[rpc-clnt.c:1968:rpc_clnt_reconfig]
>> 0-VM_Storage_1-client-12: changing port to 49170 (from 0)
>> [2021-05-20 17:58:34.162507 +0000] I [socket.c:849:__socket_shutdown]
>> 0-VM_Storage_1-client-12: intentional socket shutdown(26)
>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057]
>> [client-handshake.c:1128:select_server_supported_programs]
>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x
v1},
>> {Num=1298437}, {Version=400}]
>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043]
>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007]
>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>> failed to get from reply dict [{process-uuid}, {errno=22},
{error=Invalid
>> argument}]
>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044]
>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>> SETVOLUME on remote-host failed [{remote-error=Brick not found},
{errno=2},
>> {error=No such file or directory}]
>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051]
>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>> sending CHILD_CONNECTING event []
>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018]
>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11:
disconnected
>> from client, process will keep trying to connect glusterd until
brick's
>> port is available [{conn-name=VM_Storage_1-client-11}]
>>
>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at gmail.com>
wrote:
>>
>>> HI Ravi,
>>>
>>> thanks again for your help.
>>>
>>> Here is the output of "cat
graphs/active/VM_Storage_1-client-11/private"
>>> from the same node where glustershd is complaining:
>>>
>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>> fd.0.remote_fd = 1
>>> ------ = ------
>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>> l_start = 100, l_len = 1
>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
F_RDLCK,
>>> l_start = 101, l_len = 1
>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
F_RDLCK,
>>> l_start = 103, l_len = 1
>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>> l_start = 201, l_len = 1
>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>> l_start = 203, l_len = 1
>>> ------ = ------
>>> fd.1.remote_fd = 0
>>> ------ = ------
>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>> l_start = 100, l_len = 1
>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>> l_start = 201, l_len = 1
>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>> l_start = 203, l_len = 1
>>> ------ = ------
>>> fd.2.remote_fd = 3
>>> ------ = ------
>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>> l_start = 100, l_len = 1
>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>> l_start = 201, l_len = 1
>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>> l_start = 203, l_len = 1
>>> ------ = ------
>>> fd.3.remote_fd = 2
>>> ------ = ------
>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>> l_start = 100, l_len = 1
>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
F_RDLCK,
>>> l_start = 101, l_len = 1
>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
F_RDLCK,
>>> l_start = 103, l_len = 1
>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>> l_start = 201, l_len = 1
>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>> l_start = 203, l_len = 1
>>> ------ = ------
>>> connected = 1
>>> total_bytes_read = 6665235356
>>> ping_timeout = 42
>>> total_bytes_written = 4756303549
>>> ping_msgs_sent = 3662
>>> msgs_sent = 16786186
>>>
>>> So they seem to be connected there.
>>> *However* -- they are not connected apparently in server 2 (where I
>>> have just re-mounted the volume):
>>> [root at lab-cnvirt-h02 .meta]# cat
>>> graphs/active/VM_Storage_1-client-11/private
>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>> *connected = 0*
>>> total_bytes_read = 50020
>>> ping_timeout = 42
>>> total_bytes_written = 84628
>>> ping_msgs_sent = 0
>>> msgs_sent = 0
>>> [root at lab-cnvirt-h02 .meta]# cat
>>> graphs/active/VM_Storage_1-client-20/private
>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>> *connected = 0*
>>> total_bytes_read = 53300
>>> ping_timeout = 42
>>> total_bytes_written = 90180
>>> ping_msgs_sent = 0
>>> msgs_sent = 0
>>>
>>> The other bricks look connected...
>>>
>>> Regards,
>>> Marco
>>>
>>>
>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at
redhat.com>
>>> wrote:
>>>
>>>> Hi Marco,
>>>>
>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at
gmail.com> wrote:
>>>>
>>>>> Hi Ravi,
>>>>>
>>>>> thanks a million for your reply.
>>>>>
>>>>> I have replicated the issue in my test cluster by bringing
one of the
>>>>> nodes down, and then up again.
>>>>> The glustershd process in the restarted node is now
complaining about
>>>>> connectivity to two bricks in one of my volumes:
>>>>>
>>>>> ---
>>>>> [2021-05-19 14:05:14.462133 +0000] I
>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-11: changing port
>>>>> to 49170 (from 0)
>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>> 2-VM_Storage_1-client-11: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>> {Num=1298437}, {Version=400}]
>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
>>>>> [client-handshake.c:727:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>> failed to set the volume [{errno=2}, {error=No such file or
directory}]
>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
>>>>> [client-handshake.c:752:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>> failed to get from reply dict [{process-uuid}, {errno=22},
{error=Invalid
>>>>> argument}]
>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
>>>>> [client-handshake.c:757:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not
found}, {errno=2},
>>>>> {error=No such file or directory}]
>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
>>>>> [client-handshake.c:879:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>> sending CHILD_CONNECTING event []
>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11:
disconnected
>>>>> from client, process will keep trying to connect glusterd
until brick's
>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>> [2021-05-19 14:05:14.473598 +0000] I
>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-20: changing port
>>>>> to 49173 (from 0)
>>>>>
>>>>
>>>> The above logs indicate that shd is trying to connect to the
bricks on
>>>> ports 49170 and 49173 respectively, when it should have done so
>>>> using 49172 and 49169 (as per the volume status and ps output).
Shd gets
>>>> the brick port numbers info from glusterd, so I'm not sure
what is going on
>>>> here.  Do you have fuse mounts on this particular node?  If you
don't, you
>>>> can mount it temporarily, then check if the connection to the
bricks is
>>>> successful from the .meta folder of the mount:
>>>>
>>>> cd /path-to-fuse-mount
>>>> cd .meta
>>>> cat graphs/active/VM_Storage_1-client-11/private
>>>> cat graphs/active/VM_Storage_1-client-20/private
>>>> etc. and check if connected=1 or 0.
>>>>
>>>> I just wanted to see if it is only the shd or even the other
clients
>>>> are unable to connect to the bricks from this node. FWIW, I
tried upgrading
>>>> from 7.9 to 8.4 on a test machine and the shd was able to
connect to the
>>>> bricks just fine.
>>>> Regards,
>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>> 2-VM_Storage_1-client-20: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>> {Num=1298437}, {Version=400}]
>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
>>>>> [client-handshake.c:727:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>> failed to set the volume [{errno=2}, {error=No such file or
directory}]
>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
>>>>> [client-handshake.c:752:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>> failed to get from reply dict [{process-uuid}, {errno=22},
{error=Invalid
>>>>> argument}]
>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
>>>>> [client-handshake.c:757:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not
found}, {errno=2},
>>>>> {error=No such file or directory}]
>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
>>>>> [client-handshake.c:879:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>> sending CHILD_CONNECTING event []
>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20:
disconnected
>>>>> from client, process will keep trying to connect glusterd
until brick's
>>>>> port is available [{conn-name=VM_Storage_1-client-20}]
>>>>> ---
>>>>>
>>>>> The two bricks are the following:
>>>>> VM_Storage_1-client-20 --> Brick21:
>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
>>>>> VM_Storage_1-client-11 --> Brick12:
>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
>>>>> (In this case it the issue is on two arbiter nodes, but it
is not
>>>>> always the case)
>>>>>
>>>>> The port information via "gluster volume status
VM_Storage_1" on the
>>>>> affected node (same as the one running the glustershd
reporting the issue)
>>>>> is:
>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>>>>>                 *49172     *0          Y       3978256
>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>>>>>                 *49169     *0          Y       3978224
>>>>>
>>>>> This is aligned to the actual port of the process:
>>>>> root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18
15:56
>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick
-p
>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
>>>>> /bricks/vm_b5_arb/brick -l
>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log
--xlator-option
>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7
--process-name
>>>>> brick --brick-port *49172 *--xlator-option
>>>>> VM_Storage_1-server.listen-port=*49172*
>>>>> root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18
44:55
>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick
-p
>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name
>>>>> /bricks/vm_b3_arb/brick -l
>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log
--xlator-option
>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7
--process-name
>>>>> brick --brick-port *49169 *--xlator-option
>>>>> VM_Storage_1-server.listen-port=*49169*
>>>>>
>>>>> So the issue seems to be specifically on glustershd, as the
*glusterd
>>>>> process seems to be aware of the right port *(as it matches
the real
>>>>> port, and the brick is indeed up according to the status).
>>>>>
>>>>> I have then requested a statedump as you have suggested,
and the
>>>>> bricks seem to be not connected:
>>>>>
>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>> *connected=0*
>>>>> total_bytes_read=341120
>>>>> ping_timeout=42
>>>>> total_bytes_written=594008
>>>>> ping_msgs_sent=0
>>>>> msgs_sent=0
>>>>>
>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>> *connected=0*
>>>>> total_bytes_read=341120
>>>>> ping_timeout=42
>>>>> total_bytes_written=594008
>>>>> ping_msgs_sent=0
>>>>> msgs_sent=0
>>>>>
>>>>> The important other thing to notice is that normally the
bricks that
>>>>> are not connecting are always in the same (remote) node...
i.e. they are
>>>>> both in node 3 in this case. That seems to be always the
case, I have not
>>>>> encountered a scenario where bricks from different nodes
are reporting this
>>>>> issue (at least for the same volume).
>>>>>
>>>>> Please let me know if you need any additional info.
>>>>>
>>>>> Regards,
>>>>> Marco
>>>>>
>>>>>
>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishankar
at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf
at gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having significant issues with glustershd with
releases 8.4 and
>>>>>>> 9.1.
>>>>>>>
>>>>>>> My oVirt clusters are using gluster storage
backends, and were
>>>>>>> running fine with Gluster 7.x (shipped with earlier
versions of oVirt Node
>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster
8.4 for the nodes, and
>>>>>>> hence I have moved to this release when upgrading
my clusters.
>>>>>>>
>>>>>>> Since then I am having issues whenever one of the
nodes is brought
>>>>>>> down; when the nodes come back up online the bricks
are typically back up
>>>>>>> and working, but some (random) glustershd processes
in the various nodes
>>>>>>> seem to have issues connecting to some of them.
>>>>>>>
>>>>>>>
>>>>>> When the issue happens, can you check if the TCP port
number of the
>>>>>> brick (glusterfsd) processes displayed in `gluster
volume status` matches
>>>>>> with that of the actual port numbers observed (i.e. the
--brick-port
>>>>>> argument) when you run `ps aux | grep glusterfsd` ? If
they don't match,
>>>>>> then glusterd has incorrect brick port information in
its memory and
>>>>>> serving it to glustershd. Restarting glusterd instead
of (killing the
>>>>>> bricks + `volume start force`) should fix it, although
we need to find why
>>>>>> glusterd serves incorrect port numbers.
>>>>>>
>>>>>> If they do match, then can you take a statedump of
glustershd to
>>>>>> check that it is indeed disconnected from the bricks?
You will need
>>>>>> to verify that 'connected=1' in the statedump.
See "Self-heal is stuck/ not
>>>>>> getting completed." section in
>>>>>>
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>>>>>> Statedump can be taken by `kill -SIGUSR1
$pid-of-glustershd`. It will be
>>>>>> generated in the /var/run/gluster/ directory.
>>>>>>
>>>>>> Regards,
>>>>>> Ravi
>>>>>>
>>>>>>
>>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210525/a5b64fae/attachment.html>

Marco Fais

2021-May-26 11:32 UTC

head link

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Ravi,

thanks a million.
@Mohit, @Srijan please let me know if you need any additional information.

Thanks,
Marco


On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishankar at redhat.com>
wrote:
> Hi Marco,
> I haven't had any luck yet.  Adding Mohit and Srijan who work in
glusterd
> in case they have some inputs.
> -Ravi
>
>
> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com>
wrote:
>
>> Hi Ravi
>>
>> just wondering if you have any further thoughts on this --
unfortunately
>> it is something still very much affecting us at the moment.
>> I am trying to understand how to troubleshoot it further but
haven't been
>> able to make much progress...
>>
>> Thanks,
>> Marco
>>
>>
>> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com>
wrote:
>>
>>> Just to complete...
>>>
>>> from the FUSE mount log on server 2 I see the same errors as in
>>> glustershd.log on node 1:
>>>
>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020]
>>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators
are
>>> ready, attempting connect on transport []
>>> [2021-05-20 17:58:34.160586 +0000] I
[rpc-clnt.c:1968:rpc_clnt_reconfig]
>>> 0-VM_Storage_1-client-11: changing port to 49170 (from 0)
>>> [2021-05-20 17:58:34.160608 +0000] I
[socket.c:849:__socket_shutdown]
>>> 0-VM_Storage_1-client-11: intentional socket shutdown(20)
>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046]
>>> [client-handshake.c:857:client_setvolume_cbk]
0-VM_Storage_1-client-10:
>>> Connected, attached to remote volume
[{conn-name=VM_Storage_1-client-10},
>>> {remote_subvol=/bricks/vm_b3_vol/brick}]
>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002]
>>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3:
Client-quorum is
>>> met
>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020]
>>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators
are
>>> ready, attempting connect on transport []
>>> [2021-05-20 17:58:34.162491 +0000] I
[rpc-clnt.c:1968:rpc_clnt_reconfig]
>>> 0-VM_Storage_1-client-12: changing port to 49170 (from 0)
>>> [2021-05-20 17:58:34.162507 +0000] I
[socket.c:849:__socket_shutdown]
>>> 0-VM_Storage_1-client-12: intentional socket shutdown(26)
>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057]
>>> [client-handshake.c:1128:select_server_supported_programs]
>>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS
4.x v1},
>>> {Num=1298437}, {Version=400}]
>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043]
>>> [client-handshake.c:727:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>> failed to set the volume [{errno=2}, {error=No such file or
directory}]
>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007]
>>> [client-handshake.c:752:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>> failed to get from reply dict [{process-uuid}, {errno=22},
{error=Invalid
>>> argument}]
>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044]
>>> [client-handshake.c:757:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>> SETVOLUME on remote-host failed [{remote-error=Brick not found},
{errno=2},
>>> {error=No such file or directory}]
>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051]
>>> [client-handshake.c:879:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>> sending CHILD_CONNECTING event []
>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018]
>>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11:
disconnected
>>> from client, process will keep trying to connect glusterd until
brick's
>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>
>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at
gmail.com> wrote:
>>>
>>>> HI Ravi,
>>>>
>>>> thanks again for your help.
>>>>
>>>> Here is the output of "cat
>>>> graphs/active/VM_Storage_1-client-11/private" from the
same node
>>>> where glustershd is complaining:
>>>>
>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>> fd.0.remote_fd = 1
>>>> ------ = ------
>>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>>> l_start = 100, l_len = 1
>>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
F_RDLCK,
>>>> l_start = 101, l_len = 1
>>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
F_RDLCK,
>>>> l_start = 103, l_len = 1
>>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>>> l_start = 201, l_len = 1
>>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>>> l_start = 203, l_len = 1
>>>> ------ = ------
>>>> fd.1.remote_fd = 0
>>>> ------ = ------
>>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>>> l_start = 100, l_len = 1
>>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>>> l_start = 201, l_len = 1
>>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>>> l_start = 203, l_len = 1
>>>> ------ = ------
>>>> fd.2.remote_fd = 3
>>>> ------ = ------
>>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>>> l_start = 100, l_len = 1
>>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>>> l_start = 201, l_len = 1
>>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>>> l_start = 203, l_len = 1
>>>> ------ = ------
>>>> fd.3.remote_fd = 2
>>>> ------ = ------
>>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
F_RDLCK,
>>>> l_start = 100, l_len = 1
>>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
F_RDLCK,
>>>> l_start = 101, l_len = 1
>>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
F_RDLCK,
>>>> l_start = 103, l_len = 1
>>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
F_RDLCK,
>>>> l_start = 201, l_len = 1
>>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK
fl_type
>>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
F_RDLCK,
>>>> l_start = 203, l_len = 1
>>>> ------ = ------
>>>> connected = 1
>>>> total_bytes_read = 6665235356
>>>> ping_timeout = 42
>>>> total_bytes_written = 4756303549
>>>> ping_msgs_sent = 3662
>>>> msgs_sent = 16786186
>>>>
>>>> So they seem to be connected there.
>>>> *However* -- they are not connected apparently in server 2
(where I
>>>> have just re-mounted the volume):
>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>> graphs/active/VM_Storage_1-client-11/private
>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>> *connected = 0*
>>>> total_bytes_read = 50020
>>>> ping_timeout = 42
>>>> total_bytes_written = 84628
>>>> ping_msgs_sent = 0
>>>> msgs_sent = 0
>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>> graphs/active/VM_Storage_1-client-20/private
>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>> *connected = 0*
>>>> total_bytes_read = 53300
>>>> ping_timeout = 42
>>>> total_bytes_written = 90180
>>>> ping_msgs_sent = 0
>>>> msgs_sent = 0
>>>>
>>>> The other bricks look connected...
>>>>
>>>> Regards,
>>>> Marco
>>>>
>>>>
>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at
redhat.com>
>>>> wrote:
>>>>
>>>>> Hi Marco,
>>>>>
>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at
gmail.com> wrote:
>>>>>
>>>>>> Hi Ravi,
>>>>>>
>>>>>> thanks a million for your reply.
>>>>>>
>>>>>> I have replicated the issue in my test cluster by
bringing one of the
>>>>>> nodes down, and then up again.
>>>>>> The glustershd process in the restarted node is now
complaining about
>>>>>> connectivity to two bricks in one of my volumes:
>>>>>>
>>>>>> ---
>>>>>> [2021-05-19 14:05:14.462133 +0000] I
>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-11: changing port
>>>>>> to 49170 (from 0)
>>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
>>>>>>
[client-handshake.c:1128:select_server_supported_programs]
>>>>>> 2-VM_Storage_1-client-11: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>>> {Num=1298437}, {Version=400}]
>>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
>>>>>> [client-handshake.c:727:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>>> failed to set the volume [{errno=2}, {error=No such
file or directory}]
>>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
>>>>>> [client-handshake.c:752:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>>> failed to get from reply dict [{process-uuid},
{errno=22}, {error=Invalid
>>>>>> argument}]
>>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
>>>>>> [client-handshake.c:757:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick
not found}, {errno=2},
>>>>>> {error=No such file or directory}]
>>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
>>>>>> [client-handshake.c:879:client_setvolume_cbk]
2-VM_Storage_1-client-11:
>>>>>> sending CHILD_CONNECTING event []
>>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
>>>>>> [client.c:2229:client_rpc_notify]
2-VM_Storage_1-client-11: disconnected
>>>>>> from client, process will keep trying to connect
glusterd until brick's
>>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>>> [2021-05-19 14:05:14.473598 +0000] I
>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-20: changing port
>>>>>> to 49173 (from 0)
>>>>>>
>>>>>
>>>>> The above logs indicate that shd is trying to connect to
the bricks on
>>>>> ports 49170 and 49173 respectively, when it should have
done so
>>>>> using 49172 and 49169 (as per the volume status and ps
output). Shd gets
>>>>> the brick port numbers info from glusterd, so I'm not
sure what is going on
>>>>> here.  Do you have fuse mounts on this particular node?  If
you don't, you
>>>>> can mount it temporarily, then check if the connection to
the bricks is
>>>>> successful from the .meta folder of the mount:
>>>>>
>>>>> cd /path-to-fuse-mount
>>>>> cd .meta
>>>>> cat graphs/active/VM_Storage_1-client-11/private
>>>>> cat graphs/active/VM_Storage_1-client-20/private
>>>>> etc. and check if connected=1 or 0.
>>>>>
>>>>> I just wanted to see if it is only the shd or even the
other clients
>>>>> are unable to connect to the bricks from this node. FWIW, I
tried upgrading
>>>>> from 7.9 to 8.4 on a test machine and the shd was able to
connect to the
>>>>> bricks just fine.
>>>>> Regards,
>>>>> Ravi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
>>>>>>
[client-handshake.c:1128:select_server_supported_programs]
>>>>>> 2-VM_Storage_1-client-20: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>>> {Num=1298437}, {Version=400}]
>>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
>>>>>> [client-handshake.c:727:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>>> failed to set the volume [{errno=2}, {error=No such
file or directory}]
>>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
>>>>>> [client-handshake.c:752:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>>> failed to get from reply dict [{process-uuid},
{errno=22}, {error=Invalid
>>>>>> argument}]
>>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
>>>>>> [client-handshake.c:757:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick
not found}, {errno=2},
>>>>>> {error=No such file or directory}]
>>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
>>>>>> [client-handshake.c:879:client_setvolume_cbk]
2-VM_Storage_1-client-20:
>>>>>> sending CHILD_CONNECTING event []
>>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
>>>>>> [client.c:2229:client_rpc_notify]
2-VM_Storage_1-client-20: disconnected
>>>>>> from client, process will keep trying to connect
glusterd until brick's
>>>>>> port is available [{conn-name=VM_Storage_1-client-20}]
>>>>>> ---
>>>>>>
>>>>>> The two bricks are the following:
>>>>>> VM_Storage_1-client-20 --> Brick21:
>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
(arbiter)
>>>>>> VM_Storage_1-client-11 --> Brick12:
>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
(arbiter)
>>>>>> (In this case it the issue is on two arbiter nodes, but
it is not
>>>>>> always the case)
>>>>>>
>>>>>> The port information via "gluster volume status
VM_Storage_1" on the
>>>>>> affected node (same as the one running the glustershd
reporting the issue)
>>>>>> is:
>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>>>>>>                   *49172     *0          Y      
3978256
>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>>>>>>                   *49169     *0          Y      
3978224
>>>>>>
>>>>>> This is aligned to the actual port of the process:
>>>>>> root     3978256  1.5  0.0 1999568 30372 ?       Ssl 
May18  15:56
>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage
--volfile-id
>>>>>>
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
>>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
>>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket
--brick-name
>>>>>> /bricks/vm_b5_arb/brick -l
>>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log
--xlator-option
>>>>>>
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>> brick --brick-port *49172 *--xlator-option
>>>>>> VM_Storage_1-server.listen-port=*49172*
>>>>>> root     3978224  4.3  0.0 1867976 27928 ?       Ssl 
May18  44:55
>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage
--volfile-id
>>>>>>
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
>>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
>>>>>> -S /var/run/gluster/00d461b7d79badc9.socket
--brick-name
>>>>>> /bricks/vm_b3_arb/brick -l
>>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log
--xlator-option
>>>>>>
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>> brick --brick-port *49169 *--xlator-option
>>>>>> VM_Storage_1-server.listen-port=*49169*
>>>>>>
>>>>>> So the issue seems to be specifically on glustershd, as
the *glusterd
>>>>>> process seems to be aware of the right port *(as it
matches the real
>>>>>> port, and the brick is indeed up according to the
status).
>>>>>>
>>>>>> I have then requested a statedump as you have
suggested, and the
>>>>>> bricks seem to be not connected:
>>>>>>
>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>> *connected=0*
>>>>>> total_bytes_read=341120
>>>>>> ping_timeout=42
>>>>>> total_bytes_written=594008
>>>>>> ping_msgs_sent=0
>>>>>> msgs_sent=0
>>>>>>
>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>> *connected=0*
>>>>>> total_bytes_read=341120
>>>>>> ping_timeout=42
>>>>>> total_bytes_written=594008
>>>>>> ping_msgs_sent=0
>>>>>> msgs_sent=0
>>>>>>
>>>>>> The important other thing to notice is that normally
the bricks that
>>>>>> are not connecting are always in the same (remote)
node... i.e. they are
>>>>>> both in node 3 in this case. That seems to be always
the case, I have not
>>>>>> encountered a scenario where bricks from different
nodes are reporting this
>>>>>> issue (at least for the same volume).
>>>>>>
>>>>>> Please let me know if you need any additional info.
>>>>>>
>>>>>> Regards,
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N
<ravishankar at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais
<evilmf at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am having significant issues with glustershd
with releases 8.4
>>>>>>>> and 9.1.
>>>>>>>>
>>>>>>>> My oVirt clusters are using gluster storage
backends, and were
>>>>>>>> running fine with Gluster 7.x (shipped with
earlier versions of oVirt Node
>>>>>>>> 4.4.x). Recently the oVirt project moved to
Gluster 8.4 for the nodes, and
>>>>>>>> hence I have moved to this release when
upgrading my clusters.
>>>>>>>>
>>>>>>>> Since then I am having issues whenever one of
the nodes is brought
>>>>>>>> down; when the nodes come back up online the
bricks are typically back up
>>>>>>>> and working, but some (random) glustershd
processes in the various nodes
>>>>>>>> seem to have issues connecting to some of them.
>>>>>>>>
>>>>>>>>
>>>>>>> When the issue happens, can you check if the TCP
port number of the
>>>>>>> brick (glusterfsd) processes displayed in `gluster
volume status` matches
>>>>>>> with that of the actual port numbers observed (i.e.
the --brick-port
>>>>>>> argument) when you run `ps aux | grep glusterfsd` ?
If they don't match,
>>>>>>> then glusterd has incorrect brick port information
in its memory and
>>>>>>> serving it to glustershd. Restarting glusterd
instead of (killing the
>>>>>>> bricks + `volume start force`) should fix it,
although we need to find why
>>>>>>> glusterd serves incorrect port numbers.
>>>>>>>
>>>>>>> If they do match, then can you take a statedump of
glustershd to
>>>>>>> check that it is indeed disconnected from the
bricks? You will need
>>>>>>> to verify that 'connected=1' in the
statedump. See "Self-heal is stuck/ not
>>>>>>> getting completed." section in
>>>>>>>
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>>>>>>> Statedump can be taken by `kill -SIGUSR1
$pid-of-glustershd`. It will be
>>>>>>> generated in the /var/run/gluster/ directory.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Ravi
>>>>>>>
>>>>>>>
>>>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210526/72efb520/attachment.html>

Gluster users - May 2021 - Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1