Jo Goossens
2017-Sep-21 21:07 UTC
[Gluster-users] After stop and start wrong port is advertised
Hi, ? ? We use glusterfs 3.10.5 on Debian 9. ? When we stop or restart the service, e.g.:?service glusterfs-server restart ? We see that the wrong port get's advertised afterwards. For example: ? Before restart: ? Status of volume: public Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid ------------------------------------------------------------------------------ Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364 Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483 Brick 192.168.140.43:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 5913 Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 5932 Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 13084 Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 15499 ?Task Status of Volume public ------------------------------------------------------------------------------ There are no active volume tasks ??After restart of the service on one of the nodes (192.168.140.43) the port seems to have changed (but it didn't): ?root at app3:/var/log/glusterfs# ?gluster volume status Status of volume: public Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid ------------------------------------------------------------------------------ Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364 Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483 Brick 192.168.140.43:/gluster/public ? ? ? ?49154 ? ? 0 ? ? ? ? ?Y ? ? ? 5913 Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 4628 Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 3077 Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 28777 ?Task Status of Volume public ------------------------------------------------------------------------------ There are no active volume tasks ??However the active process is STILL the same pid AND still listening on the old port ?root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster tcp ? ? ? ?0 ? ? ?0 0.0.0.0:49152 ? ? ? ? ? 0.0.0.0:* ? ? ? ? ? ? ? LISTEN ? ? ?5913/glusterfsd ??The other nodes logs fill up with errors because they can't reach the daemon anymore. They try to reach it on the "new" port instead of the old one: ?[2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket ?So they now try 49154 instead of the old 49152? ?Is this also by design? We had a lot of issues because of this recently. We don't understand why it starts advertising a completely wrong port after stop/start. ????? Regards Jo Goossens ? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170921/42c04558/attachment.html>
Atin Mukherjee
2017-Sep-22 08:36 UTC
[Gluster-users] After stop and start wrong port is advertised
On Fri, Sep 22, 2017 at 2:37 AM, Jo Goossens <jo.goossens at hosted-power.com> wrote:> Hi, > > > > > > We use glusterfs 3.10.5 on Debian 9. > > > > When we stop or restart the service, e.g.: service glusterfs-server restart > > > > We see that the wrong port get's advertised afterwards. For example: > > > > Before restart: > > > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------ > ------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49152 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 5932 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 13084 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 15499 > > Task Status of Volume public > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > > After restart of the service on one of the nodes (192.168.140.43) the port > seems to have changed (but it didn't): > > root at app3:/var/log/glusterfs# gluster volume status > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------ > ------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49154 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 4628 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 3077 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 28777 > > Task Status of Volume public > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > > However the active process is STILL the same pid AND still listening on > the old port > > root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster > tcp 0 0 0.0.0.0:49152 0.0.0.0:* > LISTEN 5913/glusterfsd > > > The other nodes logs fill up with errors because they can't reach the > daemon anymore. They try to reach it on the "new" port instead of the old > one: > > [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > > So they now try 49154 instead of the old 49152 > > Is this also by design? We had a lot of issues because of this recently. > We don't understand why it starts advertising a completely wrong port after > stop/start. >This looks like a bug to me. For some reason glusterd's portmap is referring to a stale port (IMO) where as brick is still listening to the correct port. But ideally when glusterd service is restarted, all the portmap in-memory is rebuilt. I'd request for the following details from you to let us start analysing it: 1. glusterd statedump output from 192.168.140.43 . You can use kill -SIGUSR2 <pid of glusterd> to request for a statedump and the file will be available in /var/run/gluster 2. glusterd, brick logfile for 192.168.140.43:/gluster/public from 192.168.140.43 3. cmd_history logfile from all the nodes. 4. Content of /var/lib/glusterd/vols/public/> > > > > > > Regards > > Jo Goossens > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170922/9a7ab323/attachment.html>