thr3ads.net - Gluster users - [Gluster-users] BUG: After stop and start wrong port is advertised [Oct 2017]

If this information is useful, please help other people find it:
Share via:

Jo Goossens

2017-Sep-21 08:38 UTC

[Gluster-users] BUG: After stop and start wrong port is advertised

Hi,

?
?
We use glusterfs 3.10.5 on Debian 9.

?
When we stop or restart the service, e.g.:?service glusterfs-server restart

?
We see that the wrong port get's advertised afterwards. For example:

?
Before restart:

?
Status of volume: public
Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid
------------------------------------------------------------------------------
Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364
Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483
Brick 192.168.140.43:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 5913
Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 5932
Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 13084
Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 15499
?Task Status of Volume public
------------------------------------------------------------------------------
There are no active volume tasks
??After restart of the service on one of the nodes (192.168.140.43) the port
seems to have changed (but it didn't):
?root at app3:/var/log/glusterfs# ?gluster volume status
Status of volume: public
Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid
------------------------------------------------------------------------------
Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364
Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483
Brick 192.168.140.43:/gluster/public ? ? ? ?49154 ? ? 0 ? ? ? ? ?Y ? ? ? 5913
Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 4628
Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 3077
Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 28777
?Task Status of Volume public
------------------------------------------------------------------------------
There are no active volume tasks
??However the active process is STILL the same pid AND still listening on the
old port
?root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
tcp ? ? ? ?0 ? ? ?0 0.0.0.0:49152 ? ? ? ? ? 0.0.0.0:* ? ? ? ? ? ? ? LISTEN ? ?
?5913/glusterfsd
??The other nodes logs fill up with errors because they can't reach the
daemon anymore. They try to reach it on the "new" port instead of the
old one:
?[2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
?So they now try 49154 instead of the old 49152?
?Is this also by design? We had a lot of issues because of this recently. We
don't understand why it starts advertising a completely wrong port after
stop/start.
?????
Regards

Jo Goossens

?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170921/6d1542f3/attachment.html>

Atin Mukherjee

2017-Sep-22 13:10 UTC

head link

[Gluster-users] BUG: After stop and start wrong port is advertised

I've already replied to your earlier email. In case you've not seen it
in
your mailbox here it goes:

This looks like a bug to me. For some reason glusterd's portmap is
referring to a stale port (IMO) where as brick is still listening to the
correct port. But ideally when glusterd service is restarted, all the
portmap in-memory is rebuilt. I'd request for the following details from
you to let us start analysing it:

1. glusterd statedump output from 192.168.140.43 . You can use kill
-SIGUSR2 <pid of glusterd> to request for a statedump and the file will be
available in /var/run/gluster
2. glusterd, brick logfile for 192.168.140.43:/gluster/public from
192.168.140.43
3. cmd_history logfile from all the nodes.
4. Content of /var/lib/glusterd/vols/public/


On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at
hosted-power.com>
wrote:
> Hi,
>
>
>
>
>
> We use glusterfs 3.10.5 on Debian 9.
>
>
>
> When we stop or restart the service, e.g.: service glusterfs-server restart
>
>
>
> We see that the wrong port get's advertised afterwards. For example:
>
>
>
> Before restart:
>
>
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49152     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 5932
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 13084
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 15499
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> After restart of the service on one of the nodes (192.168.140.43) the port
> seems to have changed (but it didn't):
>
> root at app3:/var/log/glusterfs#  gluster volume status
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49154     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 4628
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 3077
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 28777
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> However the active process is STILL the same pid AND still listening on
> the old port
>
> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*
> LISTEN      5913/glusterfsd
>
>
> The other nodes logs fill up with errors because they can't reach the
> daemon anymore. They try to reach it on the "new" port instead of
the old
> one:
>
> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
>
> So they now try 49154 instead of the old 49152
>
> Is this also by design? We had a lot of issues because of this recently.
> We don't understand why it starts advertising a completely wrong port
after
> stop/start.
>
>
>
>
>
>
>
> Regards
>
> Jo Goossens
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170922/ff93f669/attachment.html>

Darrell Budic

2017-Sep-22 15:24 UTC

head link

[Gluster-users] BUG: After stop and start wrong port is advertised

I encountered this once in the past, an additional symptom was peers were in
disconnected state on the peers that were NOT using the wrong ports.
Disconnected peers is how it detected it in the first place.

It happened to me after rebooting, and I fixed it but wasn?t able to stop and
gather debugging info on the time.

The problem seemed to be that the volume files in
/var/lib/glusterd/vols/<vol-name>//bricks/<server name>\:-v0-<vol
name>-brick0 were not updated to reflect a new port # after the restart (and
the port numbers had changed to adding and deleting volumes since last start). I
stopped glusterd, killed any remaining glusterfsd?s, hand edited the files to
reflect the new ports they thought they were running the bricks on (from vol
info I think, maybe log files) and restarted glusterd, then everything was happy
again.

Hope it helps, sounds like it may be a bug to me too if others are seeing it.

  -Darrell

> On Sep 22, 2017, at 8:10 AM, Atin Mukherjee <amukherj at redhat.com>
wrote:
> 
> I've already replied to your earlier email. In case you've not seen
it in your mailbox here it goes:
> 
> This looks like a bug to me. For some reason glusterd's portmap is
referring to a stale port (IMO) where as brick is still listening to the correct
port. But ideally when glusterd service is restarted, all the portmap in-memory
is rebuilt. I'd request for the following details from you to let us start
analysing it:
> 
> 1. glusterd statedump output from 192.168.140.43 . You can use kill
-SIGUSR2 <pid of glusterd> to request for a statedump and the file will be
available in /var/run/gluster
> 2. glusterd, brick logfile for 192.168.140.43:/gluster/public from
192.168.140.43
> 3. cmd_history logfile from all the nodes.
> 4. Content of /var/lib/glusterd/vols/public/
> 
> 
> On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at
hosted-power.com> wrote:
> Hi,
> 
>  
>  
> We use glusterfs 3.10.5 on Debian 9.
> 
>  
> When we stop or restart the service, e.g.: service glusterfs-server restart
> 
>  
> We see that the wrong port get's advertised afterwards. For example:
> 
>  
> Before restart:
> 
>  
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y      
6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y      
1483
> Brick 192.168.140.43:/gluster/public        49152     0          Y      
5913
> Self-heal Daemon on localhost               N/A       N/A        Y      
5932
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y      
13084
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y      
15499
>  
> Task Status of Volume public
>
------------------------------------------------------------------------------
> There are no active volume tasks
>  
>  
> After restart of the service on one of the nodes (192.168.140.43) the port
seems to have changed (but it didn't):
>  
> root at app3:/var/log/glusterfs#  gluster volume status
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y      
6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y      
1483
> Brick 192.168.140.43:/gluster/public        49154     0          Y      
5913
> Self-heal Daemon on localhost               N/A       N/A        Y      
4628
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y      
3077
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y      
28777
>  
> Task Status of Volume public
>
------------------------------------------------------------------------------
> There are no active volume tasks
>  
>  
> However the active process is STILL the same pid AND still listening on the
old port
>  
> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN 
5913/glusterfsd
>  
>  
> The other nodes logs fill up with errors because they can't reach the
daemon anymore. They try to reach it on the "new" port instead of the
old one:
>  
> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
>  
> So they now try 49154 instead of the old 49152 
>  
> Is this also by design? We had a lot of issues because of this recently. We
don't understand why it starts advertising a completely wrong port after
stop/start.
>  
>  
>  
>  
>  
> Regards
> 
> Jo Goossens
> 
>  
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

Atin Mukherjee

2017-Oct-27 19:01 UTC

head link

[Gluster-users] BUG: After stop and start wrong port is advertised

We (finally) figured out the root cause, Jo!

Patch https://review.gluster.org/#/c/18579 posted upstream for review.

On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at
hosted-power.com>
wrote:
> Hi,
>
>
>
>
>
> We use glusterfs 3.10.5 on Debian 9.
>
>
>
> When we stop or restart the service, e.g.: service glusterfs-server restart
>
>
>
> We see that the wrong port get's advertised afterwards. For example:
>
>
>
> Before restart:
>
>
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49152     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 5932
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 13084
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 15499
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> After restart of the service on one of the nodes (192.168.140.43) the port
> seems to have changed (but it didn't):
>
> root at app3:/var/log/glusterfs#  gluster volume status
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49154     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 4628
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 3077
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 28777
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> However the active process is STILL the same pid AND still listening on
> the old port
>
> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*
> LISTEN      5913/glusterfsd
>
>
> The other nodes logs fill up with errors because they can't reach the
> daemon anymore. They try to reach it on the "new" port instead of
the old
> one:
>
> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
>
> So they now try 49154 instead of the old 49152
>
> Is this also by design? We had a lot of issues because of this recently.
> We don't understand why it starts advertising a completely wrong port
after
> stop/start.
>
>
>
>
>
>
>
> Regards
>
> Jo Goossens
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171028/f4b1023c/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

Gluster users - Oct 2017 - BUG: After stop and start wrong port is advertised

[Gluster-users] BUG: After stop and start wrong port is advertised

[Gluster-users] BUG: After stop and start wrong port is advertised

[Gluster-users] BUG: After stop and start wrong port is advertised

[Gluster-users] BUG: After stop and start wrong port is advertised

Maybe Matching Threads