thr3ads.net - Gluster users - [Gluster-users] Issues with glustershd with release 8.4 and 9.1 [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Marco Fais

2021-May-31 10:55 UTC

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Srijan

no problem at all -- thanks for your help. If you need any additional
information please let me know.

Regards,
Marco


On Thu, 27 May 2021 at 18:39, Srijan Sivakumar <ssivakum at redhat.com>
wrote:
> Hi Marco,
>
> Thank you for opening the issue. I'll check the log contents and get
back
> to you.
>
> On Thu, May 27, 2021 at 10:50 PM Marco Fais <evilmf at gmail.com>
wrote:
>
>> Srijan
>>
>> thanks a million -- I have opened the issue as requested here:
>>
>> https://github.com/gluster/glusterfs/issues/2492
>>
>> I have attached the glusterd.log and glustershd.log files, but please
let
>> me know if there is any other test I should do or logs I should
provide.
>>
>>
>> Thanks,
>> Marco
>>
>>
>> On Wed, 26 May 2021 at 18:09, Srijan Sivakumar <ssivakum at
redhat.com>
>> wrote:
>>
>>> Hi Marco,
>>>
>>> If possible, let's open an issue in github and track this from
there. I
>>> am checking the previous mails in the chain to see if I can infer
something
>>> about the situation. It would be helpful if we could analyze this
with the
>>> help of log files. Especially glusterd.log and glustershd.log.
>>>
>>> To open an issue, you can use this link : Open a new issue
>>> <https://github.com/gluster/glusterfs/issues/new>
>>>
>>> On Wed, May 26, 2021 at 5:02 PM Marco Fais <evilmf at
gmail.com> wrote:
>>>
>>>> Ravi,
>>>>
>>>> thanks a million.
>>>> @Mohit, @Srijan please let me know if you need any additional
>>>> information.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>>
>>>> On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishankar at
redhat.com>
>>>> wrote:
>>>>
>>>>> Hi Marco,
>>>>> I haven't had any luck yet.  Adding Mohit and Srijan
who work in
>>>>> glusterd in case they have some inputs.
>>>>> -Ravi
>>>>>
>>>>>
>>>>> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at
gmail.com> wrote:
>>>>>
>>>>>> Hi Ravi
>>>>>>
>>>>>> just wondering if you have any further thoughts on this
--
>>>>>> unfortunately it is something still very much affecting
us at the moment.
>>>>>> I am trying to understand how to troubleshoot it
further but haven't
>>>>>> been able to make much progress...
>>>>>>
>>>>>> Thanks,
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at
gmail.com> wrote:
>>>>>>
>>>>>>> Just to complete...
>>>>>>>
>>>>>>> from the FUSE mount log on server 2 I see the same
errors as in
>>>>>>> glustershd.log on node 1:
>>>>>>>
>>>>>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID:
114020]
>>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-11:
parent translators are
>>>>>>> ready, attempting connect on transport []
>>>>>>> [2021-05-20 17:58:34.160586 +0000] I
>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
0-VM_Storage_1-client-11: changing port
>>>>>>> to 49170 (from 0)
>>>>>>> [2021-05-20 17:58:34.160608 +0000] I
>>>>>>> [socket.c:849:__socket_shutdown]
0-VM_Storage_1-client-11: intentional
>>>>>>> socket shutdown(20)
>>>>>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID:
114046]
>>>>>>> [client-handshake.c:857:client_setvolume_cbk]
0-VM_Storage_1-client-10:
>>>>>>> Connected, attached to remote volume
[{conn-name=VM_Storage_1-client-10},
>>>>>>> {remote_subvol=/bricks/vm_b3_vol/brick}]
>>>>>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID:
108002]
>>>>>>> [afr-common.c:6435:afr_notify]
0-VM_Storage_1-replicate-3: Client-quorum is
>>>>>>> met
>>>>>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID:
114020]
>>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-13:
parent translators are
>>>>>>> ready, attempting connect on transport []
>>>>>>> [2021-05-20 17:58:34.162491 +0000] I
>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
0-VM_Storage_1-client-12: changing port
>>>>>>> to 49170 (from 0)
>>>>>>> [2021-05-20 17:58:34.162507 +0000] I
>>>>>>> [socket.c:849:__socket_shutdown]
0-VM_Storage_1-client-12: intentional
>>>>>>> socket shutdown(26)
>>>>>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID:
114057]
>>>>>>>
[client-handshake.c:1128:select_server_supported_programs]
>>>>>>> 0-VM_Storage_1-client-11: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID:
114043]
>>>>>>> [client-handshake.c:727:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>>>>>> failed to set the volume [{errno=2}, {error=No such
file or directory}]
>>>>>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID:
114007]
>>>>>>> [client-handshake.c:752:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>>>>>> failed to get from reply dict [{process-uuid},
{errno=22}, {error=Invalid
>>>>>>> argument}]
>>>>>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID:
114044]
>>>>>>> [client-handshake.c:757:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>>>>>> SETVOLUME on remote-host failed
[{remote-error=Brick not found}, {errno=2},
>>>>>>> {error=No such file or directory}]
>>>>>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID:
114051]
>>>>>>> [client-handshake.c:879:client_setvolume_cbk]
0-VM_Storage_1-client-11:
>>>>>>> sending CHILD_CONNECTING event []
>>>>>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID:
114018]
>>>>>>> [client.c:2229:client_rpc_notify]
0-VM_Storage_1-client-11: disconnected
>>>>>>> from client, process will keep trying to connect
glusterd until brick's
>>>>>>> port is available
[{conn-name=VM_Storage_1-client-11}]
>>>>>>>
>>>>>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf
at gmail.com> wrote:
>>>>>>>
>>>>>>>> HI Ravi,
>>>>>>>>
>>>>>>>> thanks again for your help.
>>>>>>>>
>>>>>>>> Here is the output of "cat
>>>>>>>>
graphs/active/VM_Storage_1-client-11/private" from the same node
>>>>>>>> where glustershd is complaining:
>>>>>>>>
>>>>>>>>
[xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>> fd.0.remote_fd = 1
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner =
7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end =
100, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 100,
l_len = 1
>>>>>>>> granted-posix-lock[1] = owner =
7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end =
101, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 101,
l_len = 1
>>>>>>>> granted-posix-lock[2] = owner =
7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end =
103, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 103,
l_len = 1
>>>>>>>> granted-posix-lock[3] = owner =
7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end =
201, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 201,
l_len = 1
>>>>>>>> granted-posix-lock[4] = owner =
7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end =
203, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 203,
l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.1.remote_fd = 0
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner =
b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end =
100, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 100,
l_len = 1
>>>>>>>> granted-posix-lock[1] = owner =
b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end =
201, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 201,
l_len = 1
>>>>>>>> granted-posix-lock[2] = owner =
b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end =
203, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 203,
l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.2.remote_fd = 3
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner =
53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end =
100, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 100,
l_len = 1
>>>>>>>> granted-posix-lock[1] = owner =
53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end =
201, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 201,
l_len = 1
>>>>>>>> granted-posix-lock[2] = owner =
53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end =
203, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 203,
l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.3.remote_fd = 2
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner =
889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end =
100, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 100,
l_len = 1
>>>>>>>> granted-posix-lock[1] = owner =
889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end =
101, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 101,
l_len = 1
>>>>>>>> granted-posix-lock[2] = owner =
889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end =
103, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 103,
l_len = 1
>>>>>>>> granted-posix-lock[3] = owner =
889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end =
201, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 201,
l_len = 1
>>>>>>>> granted-posix-lock[4] = owner =
889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end =
203, user_flock: l_type >>>>>>>> F_RDLCK, l_start = 203,
l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> connected = 1
>>>>>>>> total_bytes_read = 6665235356
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 4756303549
>>>>>>>> ping_msgs_sent = 3662
>>>>>>>> msgs_sent = 16786186
>>>>>>>>
>>>>>>>> So they seem to be connected there.
>>>>>>>> *However* -- they are not connected apparently
in server 2 (where
>>>>>>>> I have just re-mounted the volume):
>>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>>> graphs/active/VM_Storage_1-client-11/private
>>>>>>>>
[xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>> *connected = 0*
>>>>>>>> total_bytes_read = 50020
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 84628
>>>>>>>> ping_msgs_sent = 0
>>>>>>>> msgs_sent = 0
>>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>>> graphs/active/VM_Storage_1-client-20/private
>>>>>>>>
[xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>>> *connected = 0*
>>>>>>>> total_bytes_read = 53300
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 90180
>>>>>>>> ping_msgs_sent = 0
>>>>>>>> msgs_sent = 0
>>>>>>>>
>>>>>>>> The other bricks look connected...
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Marco
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N
<ravishankar at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Marco,
>>>>>>>>>
>>>>>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais
<evilmf at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ravi,
>>>>>>>>>>
>>>>>>>>>> thanks a million for your reply.
>>>>>>>>>>
>>>>>>>>>> I have replicated the issue in my test
cluster by bringing one of
>>>>>>>>>> the nodes down, and then up again.
>>>>>>>>>> The glustershd process in the restarted
node is now complaining
>>>>>>>>>> about connectivity to two bricks in one
of my volumes:
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> [2021-05-19 14:05:14.462133 +0000] I
>>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-11: changing port
>>>>>>>>>> to 49170 (from 0)
>>>>>>>>>> [2021-05-19 14:05:14.464971 +0000] I
[MSGID: 114057]
>>>>>>>>>>
[client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>>> 2-VM_Storage_1-client-11: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>>> [2021-05-19 14:05:14.465209 +0000] W
[MSGID: 114043]
>>>>>>>>>>
[client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> failed to set the volume [{errno=2},
{error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.465236 +0000] W
[MSGID: 114007]
>>>>>>>>>>
[client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> failed to get from reply dict
[{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>>> argument}]
>>>>>>>>>> [2021-05-19 14:05:14.465248 +0000] E
[MSGID: 114044]
>>>>>>>>>>
[client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> SETVOLUME on remote-host failed
[{remote-error=Brick not found}, {errno=2},
>>>>>>>>>> {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.465256 +0000] I
[MSGID: 114051]
>>>>>>>>>>
[client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>>> [2021-05-19 14:05:14.465291 +0000] I
[MSGID: 114018]
>>>>>>>>>> [client.c:2229:client_rpc_notify]
2-VM_Storage_1-client-11: disconnected
>>>>>>>>>> from client, process will keep trying
to connect glusterd until brick's
>>>>>>>>>> port is available
[{conn-name=VM_Storage_1-client-11}]
>>>>>>>>>> [2021-05-19 14:05:14.473598 +0000] I
>>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig]
2-VM_Storage_1-client-20: changing port
>>>>>>>>>> to 49173 (from 0)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The above logs indicate that shd is trying
to connect to the
>>>>>>>>> bricks on ports 49170 and 49173
respectively, when it should have
>>>>>>>>> done so using 49172 and 49169 (as per the
volume status and ps output). Shd
>>>>>>>>> gets the brick port numbers info from
glusterd, so I'm not sure what is
>>>>>>>>> going on here.  Do you have fuse mounts on
this particular node?  If you
>>>>>>>>> don't, you can mount it temporarily,
then check if the connection to the
>>>>>>>>> bricks is successful from the .meta folder
of the mount:
>>>>>>>>>
>>>>>>>>> cd /path-to-fuse-mount
>>>>>>>>> cd .meta
>>>>>>>>> cat
graphs/active/VM_Storage_1-client-11/private
>>>>>>>>> cat
graphs/active/VM_Storage_1-client-20/private
>>>>>>>>> etc. and check if connected=1 or 0.
>>>>>>>>>
>>>>>>>>> I just wanted to see if it is only the shd
or even the other
>>>>>>>>> clients are unable to connect to the bricks
from this node. FWIW, I tried
>>>>>>>>> upgrading from 7.9 to 8.4 on a test machine
and the shd was able to connect
>>>>>>>>> to the bricks just fine.
>>>>>>>>> Regards,
>>>>>>>>> Ravi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [2021-05-19 14:05:14.476543 +0000] I
[MSGID: 114057]
>>>>>>>>>>
[client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>>> 2-VM_Storage_1-client-20: Using Program
[{Program-name=GlusterFS 4.x v1},
>>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>>> [2021-05-19 14:05:14.476764 +0000] W
[MSGID: 114043]
>>>>>>>>>>
[client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> failed to set the volume [{errno=2},
{error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.476785 +0000] W
[MSGID: 114007]
>>>>>>>>>>
[client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> failed to get from reply dict
[{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>>> argument}]
>>>>>>>>>> [2021-05-19 14:05:14.476799 +0000] E
[MSGID: 114044]
>>>>>>>>>>
[client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> SETVOLUME on remote-host failed
[{remote-error=Brick not found}, {errno=2},
>>>>>>>>>> {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.476812 +0000] I
[MSGID: 114051]
>>>>>>>>>>
[client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>>> [2021-05-19 14:05:14.476849 +0000] I
[MSGID: 114018]
>>>>>>>>>> [client.c:2229:client_rpc_notify]
2-VM_Storage_1-client-20: disconnected
>>>>>>>>>> from client, process will keep trying
to connect glusterd until brick's
>>>>>>>>>> port is available
[{conn-name=VM_Storage_1-client-20}]
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> The two bricks are the following:
>>>>>>>>>> VM_Storage_1-client-20 --> Brick21:
>>>>>>>>>>
lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
>>>>>>>>>> VM_Storage_1-client-11 --> Brick12:
>>>>>>>>>>
lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
>>>>>>>>>> (In this case it the issue is on two
arbiter nodes, but it is not
>>>>>>>>>> always the case)
>>>>>>>>>>
>>>>>>>>>> The port information via "gluster
volume status VM_Storage_1" on
>>>>>>>>>> the affected node (same as the one
running the glustershd reporting the
>>>>>>>>>> issue) is:
>>>>>>>>>> Brick
lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>>>>>>>>>>                       *49172     *0    
Y       3978256
>>>>>>>>>> Brick
lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>>>>>>>>>>                       *49169     *0    
Y       3978224
>>>>>>>>>>
>>>>>>>>>> This is aligned to the actual port of
the process:
>>>>>>>>>> root     3978256  1.5  0.0 1999568
30372 ?       Ssl  May18
>>>>>>>>>>  15:56 /usr/sbin/glusterfsd -s
lab-cnvirt-h03-storage --volfile-id
>>>>>>>>>>
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
>>>>>>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
>>>>>>>>>> -S
/var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
>>>>>>>>>> /bricks/vm_b5_arb/brick -l
>>>>>>>>>>
/var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option
>>>>>>>>>>
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>>> brick --brick-port *49172
*--xlator-option
>>>>>>>>>> VM_Storage_1-server.listen-port=*49172*
>>>>>>>>>> root     3978224  4.3  0.0 1867976
27928 ?       Ssl  May18
>>>>>>>>>>  44:55 /usr/sbin/glusterfsd -s
lab-cnvirt-h03-storage --volfile-id
>>>>>>>>>>
VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
>>>>>>>>>>
/var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
>>>>>>>>>> -S
/var/run/gluster/00d461b7d79badc9.socket --brick-name
>>>>>>>>>> /bricks/vm_b3_arb/brick -l
>>>>>>>>>>
/var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option
>>>>>>>>>>
*-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>>> brick --brick-port *49169
*--xlator-option
>>>>>>>>>> VM_Storage_1-server.listen-port=*49169*
>>>>>>>>>>
>>>>>>>>>> So the issue seems to be specifically
on glustershd, as the *glusterd
>>>>>>>>>> process seems to be aware of the right
port *(as it matches the
>>>>>>>>>> real port, and the brick is indeed up
according to the status).
>>>>>>>>>>
>>>>>>>>>> I have then requested a statedump as
you have suggested, and the
>>>>>>>>>> bricks seem to be not connected:
>>>>>>>>>>
>>>>>>>>>>
[xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>>>> *connected=0*
>>>>>>>>>> total_bytes_read=341120
>>>>>>>>>> ping_timeout=42
>>>>>>>>>> total_bytes_written=594008
>>>>>>>>>> ping_msgs_sent=0
>>>>>>>>>> msgs_sent=0
>>>>>>>>>>
>>>>>>>>>>
[xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>>>>> *connected=0*
>>>>>>>>>> total_bytes_read=341120
>>>>>>>>>> ping_timeout=42
>>>>>>>>>> total_bytes_written=594008
>>>>>>>>>> ping_msgs_sent=0
>>>>>>>>>> msgs_sent=0
>>>>>>>>>>
>>>>>>>>>> The important other thing to notice is
that normally the bricks
>>>>>>>>>> that are not connecting are always in
the same (remote) node... i.e. they
>>>>>>>>>> are both in node 3 in this case. That
seems to be always the case, I have
>>>>>>>>>> not encountered a scenario where bricks
from different nodes are reporting
>>>>>>>>>> this issue (at least for the same
volume).
>>>>>>>>>>
>>>>>>>>>> Please let me know if you need any
additional info.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Marco
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 19 May 2021 at 06:31,
Ravishankar N <
>>>>>>>>>> ravishankar at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 17, 2021 at 4:22 PM
Marco Fais <evilmf at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I am having significant issues
with glustershd with releases
>>>>>>>>>>>> 8.4 and 9.1.
>>>>>>>>>>>>
>>>>>>>>>>>> My oVirt clusters are using
gluster storage backends, and were
>>>>>>>>>>>> running fine with Gluster 7.x
(shipped with earlier versions of oVirt Node
>>>>>>>>>>>> 4.4.x). Recently the oVirt
project moved to Gluster 8.4 for the nodes, and
>>>>>>>>>>>> hence I have moved to this
release when upgrading my clusters.
>>>>>>>>>>>>
>>>>>>>>>>>> Since then I am having issues
whenever one of the nodes is
>>>>>>>>>>>> brought down; when the nodes
come back up online the bricks are typically
>>>>>>>>>>>> back up and working, but some
(random) glustershd processes in the various
>>>>>>>>>>>> nodes seem to have issues
connecting to some of them.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> When the issue happens, can you
check if the TCP port number of
>>>>>>>>>>> the brick (glusterfsd) processes
displayed in `gluster volume status`
>>>>>>>>>>> matches with that of the actual
port numbers observed (i.e. the
>>>>>>>>>>> --brick-port argument) when you run
`ps aux | grep glusterfsd` ? If they
>>>>>>>>>>> don't match, then glusterd has
incorrect brick port information in its
>>>>>>>>>>> memory and serving it to
glustershd. Restarting glusterd instead of
>>>>>>>>>>> (killing the bricks + `volume start
force`) should fix it, although we need
>>>>>>>>>>> to find why glusterd serves
incorrect port numbers.
>>>>>>>>>>>
>>>>>>>>>>> If they do match, then can you take
a statedump of glustershd
>>>>>>>>>>> to check that it is indeed
disconnected from the bricks? You
>>>>>>>>>>> will need to verify that
'connected=1' in the statedump. See "Self-heal is
>>>>>>>>>>> stuck/ not getting completed."
section in
>>>>>>>>>>>
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>>>>>>>>>>> Statedump can be taken by `kill
-SIGUSR1 $pid-of-glustershd`. It will be
>>>>>>>>>>> generated in the /var/run/gluster/
directory.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Ravi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> Regards,
>>> Srijan
>>>
>>
>
> --
> Regards,
> Srijan
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210531/21d4c926/attachment.html>

Ville-Pekka Vainio

2021-Nov-05 12:27 UTC

head link

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Hi!

Bumping an old thread, because there?s now activity around this bug. The github
issue is https://github.com/gluster/glusterfs/issues/2492
We just hit this bug after an update from GlusterFS 7.x to 9.4. We did not see
this in our test environment, so we did the update, but the bug is still there.
Apparently the fix should be https://github.com/gluster/glusterfs/pull/2509
which should get backported to 9.x.

We worked around this issue by identifying the server with the bug and
restarting the GlusterFS processes on it. On an EL/CentOS/Fedora-based system
there was one small thing that surprised me, maybe this will help others.

There?s the service /usr/lib/systemd/system/glusterfsd.service which does not
really start anything, just runs /bin/true, but when stopped, will kill the
brick processes on the server. If you try doing ?systemctl stop glusterfsd? but
you have not started the service (even though starting it does nothing), systemd
will not do anything. If you first start the service and then stop it, systemd
will actually run the ExecStop command.


Best regards,
Ville-Pekka

Gluster users - Nov 2021 - Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1

[Gluster-users] Issues with glustershd with release 8.4 and 9.1