Ravishankar N
2021-May-25 16:27 UTC
[Gluster-users] Issues with glustershd with release 8.4 and 9.1
Hi Marco, I haven't had any luck yet. Adding Mohit and Srijan who work in glusterd in case they have some inputs. -Ravi On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com> wrote:> Hi Ravi > > just wondering if you have any further thoughts on this -- unfortunately > it is something still very much affecting us at the moment. > I am trying to understand how to troubleshoot it further but haven't been > able to make much progress... > > Thanks, > Marco > > > On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com> wrote: > >> Just to complete... >> >> from the FUSE mount log on server 2 I see the same errors as in >> glustershd.log on node 1: >> >> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020] >> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are >> ready, attempting connect on transport [] >> [2021-05-20 17:58:34.160586 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >> 0-VM_Storage_1-client-11: changing port to 49170 (from 0) >> [2021-05-20 17:58:34.160608 +0000] I [socket.c:849:__socket_shutdown] >> 0-VM_Storage_1-client-11: intentional socket shutdown(20) >> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046] >> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10: >> Connected, attached to remote volume [{conn-name=VM_Storage_1-client-10}, >> {remote_subvol=/bricks/vm_b3_vol/brick}] >> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002] >> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: Client-quorum is >> met >> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020] >> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are >> ready, attempting connect on transport [] >> [2021-05-20 17:58:34.162491 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >> 0-VM_Storage_1-client-12: changing port to 49170 (from 0) >> [2021-05-20 17:58:34.162507 +0000] I [socket.c:849:__socket_shutdown] >> 0-VM_Storage_1-client-12: intentional socket shutdown(26) >> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057] >> [client-handshake.c:1128:select_server_supported_programs] >> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >> {Num=1298437}, {Version=400}] >> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043] >> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11: >> failed to set the volume [{errno=2}, {error=No such file or directory}] >> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007] >> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11: >> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >> argument}] >> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044] >> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11: >> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >> {error=No such file or directory}] >> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051] >> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11: >> sending CHILD_CONNECTING event [] >> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018] >> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected >> from client, process will keep trying to connect glusterd until brick's >> port is available [{conn-name=VM_Storage_1-client-11}] >> >> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at gmail.com> wrote: >> >>> HI Ravi, >>> >>> thanks again for your help. >>> >>> Here is the output of "cat graphs/active/VM_Storage_1-client-11/private" >>> from the same node where glustershd is complaining: >>> >>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>> fd.0.remote_fd = 1 >>> ------ = ------ >>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>> l_start = 100, l_len = 1 >>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, >>> l_start = 101, l_len = 1 >>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, >>> l_start = 103, l_len = 1 >>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>> l_start = 201, l_len = 1 >>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>> l_start = 203, l_len = 1 >>> ------ = ------ >>> fd.1.remote_fd = 0 >>> ------ = ------ >>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>> l_start = 100, l_len = 1 >>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>> l_start = 201, l_len = 1 >>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>> l_start = 203, l_len = 1 >>> ------ = ------ >>> fd.2.remote_fd = 3 >>> ------ = ------ >>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>> l_start = 100, l_len = 1 >>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>> l_start = 201, l_len = 1 >>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>> l_start = 203, l_len = 1 >>> ------ = ------ >>> fd.3.remote_fd = 2 >>> ------ = ------ >>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>> l_start = 100, l_len = 1 >>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, >>> l_start = 101, l_len = 1 >>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, >>> l_start = 103, l_len = 1 >>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>> l_start = 201, l_len = 1 >>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>> l_start = 203, l_len = 1 >>> ------ = ------ >>> connected = 1 >>> total_bytes_read = 6665235356 >>> ping_timeout = 42 >>> total_bytes_written = 4756303549 >>> ping_msgs_sent = 3662 >>> msgs_sent = 16786186 >>> >>> So they seem to be connected there. >>> *However* -- they are not connected apparently in server 2 (where I >>> have just re-mounted the volume): >>> [root at lab-cnvirt-h02 .meta]# cat >>> graphs/active/VM_Storage_1-client-11/private >>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>> *connected = 0* >>> total_bytes_read = 50020 >>> ping_timeout = 42 >>> total_bytes_written = 84628 >>> ping_msgs_sent = 0 >>> msgs_sent = 0 >>> [root at lab-cnvirt-h02 .meta]# cat >>> graphs/active/VM_Storage_1-client-20/private >>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>> *connected = 0* >>> total_bytes_read = 53300 >>> ping_timeout = 42 >>> total_bytes_written = 90180 >>> ping_msgs_sent = 0 >>> msgs_sent = 0 >>> >>> The other bricks look connected... >>> >>> Regards, >>> Marco >>> >>> >>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at redhat.com> >>> wrote: >>> >>>> Hi Marco, >>>> >>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at gmail.com> wrote: >>>> >>>>> Hi Ravi, >>>>> >>>>> thanks a million for your reply. >>>>> >>>>> I have replicated the issue in my test cluster by bringing one of the >>>>> nodes down, and then up again. >>>>> The glustershd process in the restarted node is now complaining about >>>>> connectivity to two bricks in one of my volumes: >>>>> >>>>> --- >>>>> [2021-05-19 14:05:14.462133 +0000] I >>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing port >>>>> to 49170 (from 0) >>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] >>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >>>>> {Num=1298437}, {Version=400}] >>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] >>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] >>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>>>> argument}] >>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] >>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >>>>> {error=No such file or directory}] >>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] >>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>> sending CHILD_CONNECTING event [] >>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] >>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected >>>>> from client, process will keep trying to connect glusterd until brick's >>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>> [2021-05-19 14:05:14.473598 +0000] I >>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing port >>>>> to 49173 (from 0) >>>>> >>>> >>>> The above logs indicate that shd is trying to connect to the bricks on >>>> ports 49170 and 49173 respectively, when it should have done so >>>> using 49172 and 49169 (as per the volume status and ps output). Shd gets >>>> the brick port numbers info from glusterd, so I'm not sure what is going on >>>> here. Do you have fuse mounts on this particular node? If you don't, you >>>> can mount it temporarily, then check if the connection to the bricks is >>>> successful from the .meta folder of the mount: >>>> >>>> cd /path-to-fuse-mount >>>> cd .meta >>>> cat graphs/active/VM_Storage_1-client-11/private >>>> cat graphs/active/VM_Storage_1-client-20/private >>>> etc. and check if connected=1 or 0. >>>> >>>> I just wanted to see if it is only the shd or even the other clients >>>> are unable to connect to the bricks from this node. FWIW, I tried upgrading >>>> from 7.9 to 8.4 on a test machine and the shd was able to connect to the >>>> bricks just fine. >>>> Regards, >>>> Ravi >>>> >>>> >>>> >>>> >>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] >>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1}, >>>>> {Num=1298437}, {Version=400}] >>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] >>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] >>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>>>> argument}] >>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] >>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >>>>> {error=No such file or directory}] >>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] >>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>> sending CHILD_CONNECTING event [] >>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] >>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected >>>>> from client, process will keep trying to connect glusterd until brick's >>>>> port is available [{conn-name=VM_Storage_1-client-20}] >>>>> --- >>>>> >>>>> The two bricks are the following: >>>>> VM_Storage_1-client-20 --> Brick21: >>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter) >>>>> VM_Storage_1-client-11 --> Brick12: >>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter) >>>>> (In this case it the issue is on two arbiter nodes, but it is not >>>>> always the case) >>>>> >>>>> The port information via "gluster volume status VM_Storage_1" on the >>>>> affected node (same as the one running the glustershd reporting the issue) >>>>> is: >>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick >>>>> *49172 *0 Y 3978256 >>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick >>>>> *49169 *0 Y 3978224 >>>>> >>>>> This is aligned to the actual port of the process: >>>>> root 3978256 1.5 0.0 1999568 30372 ? Ssl May18 15:56 >>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p >>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid >>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name >>>>> /bricks/vm_b5_arb/brick -l >>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option >>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >>>>> brick --brick-port *49172 *--xlator-option >>>>> VM_Storage_1-server.listen-port=*49172* >>>>> root 3978224 4.3 0.0 1867976 27928 ? Ssl May18 44:55 >>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p >>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid >>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name >>>>> /bricks/vm_b3_arb/brick -l >>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option >>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >>>>> brick --brick-port *49169 *--xlator-option >>>>> VM_Storage_1-server.listen-port=*49169* >>>>> >>>>> So the issue seems to be specifically on glustershd, as the *glusterd >>>>> process seems to be aware of the right port *(as it matches the real >>>>> port, and the brick is indeed up according to the status). >>>>> >>>>> I have then requested a statedump as you have suggested, and the >>>>> bricks seem to be not connected: >>>>> >>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>> *connected=0* >>>>> total_bytes_read=341120 >>>>> ping_timeout=42 >>>>> total_bytes_written=594008 >>>>> ping_msgs_sent=0 >>>>> msgs_sent=0 >>>>> >>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>> *connected=0* >>>>> total_bytes_read=341120 >>>>> ping_timeout=42 >>>>> total_bytes_written=594008 >>>>> ping_msgs_sent=0 >>>>> msgs_sent=0 >>>>> >>>>> The important other thing to notice is that normally the bricks that >>>>> are not connecting are always in the same (remote) node... i.e. they are >>>>> both in node 3 in this case. That seems to be always the case, I have not >>>>> encountered a scenario where bricks from different nodes are reporting this >>>>> issue (at least for the same volume). >>>>> >>>>> Please let me know if you need any additional info. >>>>> >>>>> Regards, >>>>> Marco >>>>> >>>>> >>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishankar at redhat.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am having significant issues with glustershd with releases 8.4 and >>>>>>> 9.1. >>>>>>> >>>>>>> My oVirt clusters are using gluster storage backends, and were >>>>>>> running fine with Gluster 7.x (shipped with earlier versions of oVirt Node >>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and >>>>>>> hence I have moved to this release when upgrading my clusters. >>>>>>> >>>>>>> Since then I am having issues whenever one of the nodes is brought >>>>>>> down; when the nodes come back up online the bricks are typically back up >>>>>>> and working, but some (random) glustershd processes in the various nodes >>>>>>> seem to have issues connecting to some of them. >>>>>>> >>>>>>> >>>>>> When the issue happens, can you check if the TCP port number of the >>>>>> brick (glusterfsd) processes displayed in `gluster volume status` matches >>>>>> with that of the actual port numbers observed (i.e. the --brick-port >>>>>> argument) when you run `ps aux | grep glusterfsd` ? If they don't match, >>>>>> then glusterd has incorrect brick port information in its memory and >>>>>> serving it to glustershd. Restarting glusterd instead of (killing the >>>>>> bricks + `volume start force`) should fix it, although we need to find why >>>>>> glusterd serves incorrect port numbers. >>>>>> >>>>>> If they do match, then can you take a statedump of glustershd to >>>>>> check that it is indeed disconnected from the bricks? You will need >>>>>> to verify that 'connected=1' in the statedump. See "Self-heal is stuck/ not >>>>>> getting completed." section in >>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. >>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be >>>>>> generated in the /var/run/gluster/ directory. >>>>>> >>>>>> Regards, >>>>>> Ravi >>>>>> >>>>>> >>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210525/a5b64fae/attachment.html>
Marco Fais
2021-May-26 11:32 UTC
[Gluster-users] Issues with glustershd with release 8.4 and 9.1
Ravi, thanks a million. @Mohit, @Srijan please let me know if you need any additional information. Thanks, Marco On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishankar at redhat.com> wrote:> Hi Marco, > I haven't had any luck yet. Adding Mohit and Srijan who work in glusterd > in case they have some inputs. > -Ravi > > > On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com> wrote: > >> Hi Ravi >> >> just wondering if you have any further thoughts on this -- unfortunately >> it is something still very much affecting us at the moment. >> I am trying to understand how to troubleshoot it further but haven't been >> able to make much progress... >> >> Thanks, >> Marco >> >> >> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com> wrote: >> >>> Just to complete... >>> >>> from the FUSE mount log on server 2 I see the same errors as in >>> glustershd.log on node 1: >>> >>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020] >>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are >>> ready, attempting connect on transport [] >>> [2021-05-20 17:58:34.160586 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >>> 0-VM_Storage_1-client-11: changing port to 49170 (from 0) >>> [2021-05-20 17:58:34.160608 +0000] I [socket.c:849:__socket_shutdown] >>> 0-VM_Storage_1-client-11: intentional socket shutdown(20) >>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046] >>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10: >>> Connected, attached to remote volume [{conn-name=VM_Storage_1-client-10}, >>> {remote_subvol=/bricks/vm_b3_vol/brick}] >>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002] >>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: Client-quorum is >>> met >>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020] >>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are >>> ready, attempting connect on transport [] >>> [2021-05-20 17:58:34.162491 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >>> 0-VM_Storage_1-client-12: changing port to 49170 (from 0) >>> [2021-05-20 17:58:34.162507 +0000] I [socket.c:849:__socket_shutdown] >>> 0-VM_Storage_1-client-12: intentional socket shutdown(26) >>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057] >>> [client-handshake.c:1128:select_server_supported_programs] >>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >>> {Num=1298437}, {Version=400}] >>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043] >>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007] >>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>> argument}] >>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044] >>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >>> {error=No such file or directory}] >>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051] >>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>> sending CHILD_CONNECTING event [] >>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018] >>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected >>> from client, process will keep trying to connect glusterd until brick's >>> port is available [{conn-name=VM_Storage_1-client-11}] >>> >>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at gmail.com> wrote: >>> >>>> HI Ravi, >>>> >>>> thanks again for your help. >>>> >>>> Here is the output of "cat >>>> graphs/active/VM_Storage_1-client-11/private" from the same node >>>> where glustershd is complaining: >>>> >>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>> fd.0.remote_fd = 1 >>>> ------ = ------ >>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>>> l_start = 100, l_len = 1 >>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, >>>> l_start = 101, l_len = 1 >>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, >>>> l_start = 103, l_len = 1 >>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>>> l_start = 201, l_len = 1 >>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>>> l_start = 203, l_len = 1 >>>> ------ = ------ >>>> fd.1.remote_fd = 0 >>>> ------ = ------ >>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>>> l_start = 100, l_len = 1 >>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>>> l_start = 201, l_len = 1 >>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>>> l_start = 203, l_len = 1 >>>> ------ = ------ >>>> fd.2.remote_fd = 3 >>>> ------ = ------ >>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>>> l_start = 100, l_len = 1 >>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>>> l_start = 201, l_len = 1 >>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>>> l_start = 203, l_len = 1 >>>> ------ = ------ >>>> fd.3.remote_fd = 2 >>>> ------ = ------ >>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, >>>> l_start = 100, l_len = 1 >>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, >>>> l_start = 101, l_len = 1 >>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, >>>> l_start = 103, l_len = 1 >>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, >>>> l_start = 201, l_len = 1 >>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK fl_type >>>> = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, >>>> l_start = 203, l_len = 1 >>>> ------ = ------ >>>> connected = 1 >>>> total_bytes_read = 6665235356 >>>> ping_timeout = 42 >>>> total_bytes_written = 4756303549 >>>> ping_msgs_sent = 3662 >>>> msgs_sent = 16786186 >>>> >>>> So they seem to be connected there. >>>> *However* -- they are not connected apparently in server 2 (where I >>>> have just re-mounted the volume): >>>> [root at lab-cnvirt-h02 .meta]# cat >>>> graphs/active/VM_Storage_1-client-11/private >>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>> *connected = 0* >>>> total_bytes_read = 50020 >>>> ping_timeout = 42 >>>> total_bytes_written = 84628 >>>> ping_msgs_sent = 0 >>>> msgs_sent = 0 >>>> [root at lab-cnvirt-h02 .meta]# cat >>>> graphs/active/VM_Storage_1-client-20/private >>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>> *connected = 0* >>>> total_bytes_read = 53300 >>>> ping_timeout = 42 >>>> total_bytes_written = 90180 >>>> ping_msgs_sent = 0 >>>> msgs_sent = 0 >>>> >>>> The other bricks look connected... >>>> >>>> Regards, >>>> Marco >>>> >>>> >>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at redhat.com> >>>> wrote: >>>> >>>>> Hi Marco, >>>>> >>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at gmail.com> wrote: >>>>> >>>>>> Hi Ravi, >>>>>> >>>>>> thanks a million for your reply. >>>>>> >>>>>> I have replicated the issue in my test cluster by bringing one of the >>>>>> nodes down, and then up again. >>>>>> The glustershd process in the restarted node is now complaining about >>>>>> connectivity to two bricks in one of my volumes: >>>>>> >>>>>> --- >>>>>> [2021-05-19 14:05:14.462133 +0000] I >>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing port >>>>>> to 49170 (from 0) >>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] >>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >>>>>> {Num=1298437}, {Version=400}] >>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] >>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] >>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>>>>> argument}] >>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] >>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >>>>>> {error=No such file or directory}] >>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] >>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>> sending CHILD_CONNECTING event [] >>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] >>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected >>>>>> from client, process will keep trying to connect glusterd until brick's >>>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>>> [2021-05-19 14:05:14.473598 +0000] I >>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing port >>>>>> to 49173 (from 0) >>>>>> >>>>> >>>>> The above logs indicate that shd is trying to connect to the bricks on >>>>> ports 49170 and 49173 respectively, when it should have done so >>>>> using 49172 and 49169 (as per the volume status and ps output). Shd gets >>>>> the brick port numbers info from glusterd, so I'm not sure what is going on >>>>> here. Do you have fuse mounts on this particular node? If you don't, you >>>>> can mount it temporarily, then check if the connection to the bricks is >>>>> successful from the .meta folder of the mount: >>>>> >>>>> cd /path-to-fuse-mount >>>>> cd .meta >>>>> cat graphs/active/VM_Storage_1-client-11/private >>>>> cat graphs/active/VM_Storage_1-client-20/private >>>>> etc. and check if connected=1 or 0. >>>>> >>>>> I just wanted to see if it is only the shd or even the other clients >>>>> are unable to connect to the bricks from this node. FWIW, I tried upgrading >>>>> from 7.9 to 8.4 on a test machine and the shd was able to connect to the >>>>> bricks just fine. >>>>> Regards, >>>>> Ravi >>>>> >>>>> >>>>> >>>>> >>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] >>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1}, >>>>>> {Num=1298437}, {Version=400}] >>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] >>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] >>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>>>>> argument}] >>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] >>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >>>>>> {error=No such file or directory}] >>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] >>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>> sending CHILD_CONNECTING event [] >>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] >>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected >>>>>> from client, process will keep trying to connect glusterd until brick's >>>>>> port is available [{conn-name=VM_Storage_1-client-20}] >>>>>> --- >>>>>> >>>>>> The two bricks are the following: >>>>>> VM_Storage_1-client-20 --> Brick21: >>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter) >>>>>> VM_Storage_1-client-11 --> Brick12: >>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter) >>>>>> (In this case it the issue is on two arbiter nodes, but it is not >>>>>> always the case) >>>>>> >>>>>> The port information via "gluster volume status VM_Storage_1" on the >>>>>> affected node (same as the one running the glustershd reporting the issue) >>>>>> is: >>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick >>>>>> *49172 *0 Y 3978256 >>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick >>>>>> *49169 *0 Y 3978224 >>>>>> >>>>>> This is aligned to the actual port of the process: >>>>>> root 3978256 1.5 0.0 1999568 30372 ? Ssl May18 15:56 >>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p >>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid >>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name >>>>>> /bricks/vm_b5_arb/brick -l >>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option >>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >>>>>> brick --brick-port *49172 *--xlator-option >>>>>> VM_Storage_1-server.listen-port=*49172* >>>>>> root 3978224 4.3 0.0 1867976 27928 ? Ssl May18 44:55 >>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p >>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid >>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name >>>>>> /bricks/vm_b3_arb/brick -l >>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option >>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >>>>>> brick --brick-port *49169 *--xlator-option >>>>>> VM_Storage_1-server.listen-port=*49169* >>>>>> >>>>>> So the issue seems to be specifically on glustershd, as the *glusterd >>>>>> process seems to be aware of the right port *(as it matches the real >>>>>> port, and the brick is indeed up according to the status). >>>>>> >>>>>> I have then requested a statedump as you have suggested, and the >>>>>> bricks seem to be not connected: >>>>>> >>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>> *connected=0* >>>>>> total_bytes_read=341120 >>>>>> ping_timeout=42 >>>>>> total_bytes_written=594008 >>>>>> ping_msgs_sent=0 >>>>>> msgs_sent=0 >>>>>> >>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>>> *connected=0* >>>>>> total_bytes_read=341120 >>>>>> ping_timeout=42 >>>>>> total_bytes_written=594008 >>>>>> ping_msgs_sent=0 >>>>>> msgs_sent=0 >>>>>> >>>>>> The important other thing to notice is that normally the bricks that >>>>>> are not connecting are always in the same (remote) node... i.e. they are >>>>>> both in node 3 in this case. That seems to be always the case, I have not >>>>>> encountered a scenario where bricks from different nodes are reporting this >>>>>> issue (at least for the same volume). >>>>>> >>>>>> Please let me know if you need any additional info. >>>>>> >>>>>> Regards, >>>>>> Marco >>>>>> >>>>>> >>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishankar at redhat.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am having significant issues with glustershd with releases 8.4 >>>>>>>> and 9.1. >>>>>>>> >>>>>>>> My oVirt clusters are using gluster storage backends, and were >>>>>>>> running fine with Gluster 7.x (shipped with earlier versions of oVirt Node >>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and >>>>>>>> hence I have moved to this release when upgrading my clusters. >>>>>>>> >>>>>>>> Since then I am having issues whenever one of the nodes is brought >>>>>>>> down; when the nodes come back up online the bricks are typically back up >>>>>>>> and working, but some (random) glustershd processes in the various nodes >>>>>>>> seem to have issues connecting to some of them. >>>>>>>> >>>>>>>> >>>>>>> When the issue happens, can you check if the TCP port number of the >>>>>>> brick (glusterfsd) processes displayed in `gluster volume status` matches >>>>>>> with that of the actual port numbers observed (i.e. the --brick-port >>>>>>> argument) when you run `ps aux | grep glusterfsd` ? If they don't match, >>>>>>> then glusterd has incorrect brick port information in its memory and >>>>>>> serving it to glustershd. Restarting glusterd instead of (killing the >>>>>>> bricks + `volume start force`) should fix it, although we need to find why >>>>>>> glusterd serves incorrect port numbers. >>>>>>> >>>>>>> If they do match, then can you take a statedump of glustershd to >>>>>>> check that it is indeed disconnected from the bricks? You will need >>>>>>> to verify that 'connected=1' in the statedump. See "Self-heal is stuck/ not >>>>>>> getting completed." section in >>>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. >>>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be >>>>>>> generated in the /var/run/gluster/ directory. >>>>>>> >>>>>>> Regards, >>>>>>> Ravi >>>>>>> >>>>>>> >>>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210526/72efb520/attachment.html>