Atin Mukherjee
2017-Oct-27  19:01 UTC
[Gluster-users] BUG: After stop and start wrong port is advertised
We (finally) figured out the root cause, Jo! Patch https://review.gluster.org/#/c/18579 posted upstream for review. On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at hosted-power.com> wrote:> Hi, > > > > > > We use glusterfs 3.10.5 on Debian 9. > > > > When we stop or restart the service, e.g.: service glusterfs-server restart > > > > We see that the wrong port get's advertised afterwards. For example: > > > > Before restart: > > > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------ > ------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49152 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 5932 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 13084 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 15499 > > Task Status of Volume public > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > > After restart of the service on one of the nodes (192.168.140.43) the port > seems to have changed (but it didn't): > > root at app3:/var/log/glusterfs# gluster volume status > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------ > ------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49154 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 4628 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 3077 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 28777 > > Task Status of Volume public > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > > However the active process is STILL the same pid AND still listening on > the old port > > root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster > tcp 0 0 0.0.0.0:49152 0.0.0.0:* > LISTEN 5913/glusterfsd > > > The other nodes logs fill up with errors because they can't reach the > daemon anymore. They try to reach it on the "new" port instead of the old > one: > > [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > > So they now try 49154 instead of the old 49152 > > Is this also by design? We had a lot of issues because of this recently. > We don't understand why it starts advertising a completely wrong port after > stop/start. > > > > > > > > Regards > > Jo Goossens > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171028/f4b1023c/attachment.html>
Jo Goossens
2017-Oct-27  21:06 UTC
[Gluster-users] BUG: After stop and start wrong port is advertised
Hello Atin, ? ? I just read it and very happy you found the issue. We really hope this will be fixed in the next 3.10.7 version! ? ? PS: Wow nice all that c code and those "goto out" statements (not always considered clean but the best way often I think). Can remember the days I wrote kernel drivers myself in c :) ? ? Regards Jo Goossens ? ? ? -----Original message----- From:Atin Mukherjee <amukherj at redhat.com> Sent:Fri 27-10-2017 21:01 Subject:Re: [Gluster-users] BUG: After stop and start wrong port is advertised To:Jo Goossens <jo.goossens at hosted-power.com>; CC:gluster-users at gluster.org; We (finally) figured out the root cause, Jo! ?Patch https://review.gluster.org/#/c/18579 posted upstream for review. On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at hosted-power.com <mailto:jo.goossens at hosted-power.com> > wrote: Hi, ? ? We use glusterfs 3.10.5 on Debian 9. ? When we stop or restart the service, e.g.:?service glusterfs-server restart ? We see that the wrong port get's advertised afterwards. For example: ? Before restart: ? Status of volume: public Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid ------------------------------------------------------------------------------ Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364 Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483 Brick 192.168.140.43:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 5913 Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 5932 Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 13084 Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 15499 ?Task Status of Volume public ------------------------------------------------------------------------------ There are no active volume tasks ??After restart of the service on one of the nodes (192.168.140.43) the port seems to have changed (but it didn't): ?root at app3:/var/log/glusterfs# ?gluster volume status Status of volume: public Gluster process ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port ?RDMA Port ?Online ?Pid ------------------------------------------------------------------------------ Brick 192.168.140.41:/gluster/public ? ? ? ?49153 ? ? 0 ? ? ? ? ?Y ? ? ? 6364 Brick 192.168.140.42:/gluster/public ? ? ? ?49152 ? ? 0 ? ? ? ? ?Y ? ? ? 1483 Brick 192.168.140.43:/gluster/public ? ? ? ?49154 ? ? 0 ? ? ? ? ?Y ? ? ? 5913 Self-heal Daemon on localhost ? ? ? ? ? ? ? N/A ? ? ? N/A ? ? ? ?Y ? ? ? 4628 Self-heal Daemon on 192.168.140.42 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 3077 Self-heal Daemon on 192.168.140.41 ? ? ? ? ?N/A ? ? ? N/A ? ? ? ?Y ? ? ? 28777 ?Task Status of Volume public ------------------------------------------------------------------------------ There are no active volume tasks ??However the active process is STILL the same pid AND still listening on the old port ?root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster tcp ? ? ? ?0 ? ? ?0 0.0.0.0:49152 <http://0.0.0.0:49152> ? ? ? ? ? 0.0.0.0:* ? ? ? ? ? ? ? LISTEN ? ? ?5913/glusterfsd ??The other nodes logs fill up with errors because they can't reach the daemon anymore. They try to reach it on the "new" port instead of the old one: ?[2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 <http://192.168.140.43:49154> failed (Connection refused); disconnecting socket [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 <http://192.168.140.43:49154> failed (Connection refused); disconnecting socket [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 <http://192.168.140.43:49154> failed (Connection refused); disconnecting socket [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 <http://192.168.140.43:49154> failed (Connection refused); disconnecting socket [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0) [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 <http://192.168.140.43:49154> failed (Connection refused); disconnecting socket ?So they now try 49154 instead of the old 49152? ?Is this also by design? We had a lot of issues because of this recently. We don't understand why it starts advertising a completely wrong port after stop/start. ????? Regards Jo Goossens ? _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> http://lists.gluster.org/mailman/listinfo/gluster-users <http://lists.gluster.org/mailman/listinfo/gluster-users> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171027/2c457c8f/attachment.html>
Atin Mukherjee
2017-Oct-30  16:40 UTC
[Gluster-users] BUG: After stop and start wrong port is advertised
On Sat, 28 Oct 2017 at 02:36, Jo Goossens <jo.goossens at hosted-power.com> wrote:> Hello Atin, > > > > > > I just read it and very happy you found the issue. We really hope this > will be fixed in the next 3.10.7 version! >3.10.7 - no I guess as the patch is still in review and 3.10.7 is getting tagged today. You?ll get this fix in 3.10.8.> > > > PS: Wow nice all that c code and those "goto out" statements (not always > considered clean but the best way often I think). Can remember the days I > wrote kernel drivers myself in c :) > > > > > > Regards > > Jo Goossens > > > > > > > > > -----Original message----- > *From:* Atin Mukherjee <amukherj at redhat.com> > *Sent:* Fri 27-10-2017 21:01 > *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port is > advertised > *To:* Jo Goossens <jo.goossens at hosted-power.com>; > *CC:* gluster-users at gluster.org; > > We (finally) figured out the root cause, Jo! > > Patch https://review.gluster.org/#/c/18579 posted upstream for review. > > On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at hosted-power.com > > wrote: > > Hi, > > > > > > We use glusterfs 3.10.5 on Debian 9. > > > > When we stop or restart the service, e.g.: service glusterfs-server restart > > > > We see that the wrong port get's advertised afterwards. For example: > > > > Before restart: > > > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49152 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 5932 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 13084 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 15499 > > Task Status of Volume public > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > After restart of the service on one of the nodes (192.168.140.43) the port > seems to have changed (but it didn't): > > root at app3:/var/log/glusterfs# gluster volume status > Status of volume: public > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick 192.168.140.41:/gluster/public 49153 0 Y > 6364 > Brick 192.168.140.42:/gluster/public 49152 0 Y > 1483 > Brick 192.168.140.43:/gluster/public 49154 0 Y > 5913 > Self-heal Daemon on localhost N/A N/A Y > 4628 > Self-heal Daemon on 192.168.140.42 N/A N/A Y > 3077 > Self-heal Daemon on 192.168.140.41 N/A N/A Y > 28777 > > Task Status of Volume public > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > However the active process is STILL the same pid AND still listening on > the old port > > root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster > tcp 0 0 0.0.0.0:49152 0.0.0.0:* > LISTEN 5913/glusterfsd > > > The other nodes logs fill up with errors because they can't reach the > daemon anymore. They try to reach it on the "new" port instead of the old > one: > > [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-public-client-2: changing port to 49154 (from 0) > [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] > 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection > refused); disconnecting socket > > So they now try 49154 instead of the old 49152 > > Is this also by design? We had a lot of issues because of this recently. > We don't understand why it starts advertising a completely wrong port after > stop/start. > > > > > > > > Regards > > Jo Goossens > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users > > --- Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171030/c7a32f87/attachment.html>
Mike Hulsman
2017-Nov-08  11:28 UTC
[Gluster-users] BUG: After stop and start wrong port is advertised
Hi, 
This bug is hitting me hard on two different clients. 
In RHGS 3.3 and on glusterfs 3.10.2 on Centos 7.4 
in once case I had 59 differences in a total of 203 bricks. 
I wrote a quick and dirty script to check all ports against the brick file and
the running process.
#!/bin/bash 
Host=`uname -n| awk -F"." '{print $1}'` 
GlusterVol=`ps -eaf | grep /usr/sbin/glusterfsd| grep -v grep | awk '{print
$NF}'| awk -F"-server" '{print $1}'|sort | uniq`
Port=`ps -eaf | grep /usr/sbin/glusterfsd| grep -v grep | awk '{print
$NF}'| awk -F"." '{print $NF}'`
for Volumes in ${GlusterVol}; 
do 
cd /var/lib/glusterd/vols/${Volumes}/bricks 
Bricks=`ls ${Host}*` 
for Brick in ${Bricks}; 
do 
Onfile=`grep ^listen-port "${Brick}"` 
BrickDir=`echo "${Brick}"| awk -F":" '{print $2}'|
cut -c2-`
Daemon=`ps -eaf | grep "\${BrickDir}.pid" |grep -v grep | awk
'{print $NF}' | awk -F"." '{print $2}'`
#echo Onfile: ${Onfile} 
#echo Daemon: ${Daemon} 
if [ "${Onfile}" = "${Daemon}" ]; then 
echo "OK For ${Brick}" 
else 
echo "!!! NOT OK For ${Brick}" 
fi 
done 
done 
Met vriendelijke groet, 
Mike Hulsman 
Proxy Managed Services B.V. | www.proxy.nl | Enterprise IT-Infra, Open Source
and Cloud Technology
Delftweg 128 3043 NB Rotterdam The Netherlands | +31 10 307 0965 
From: "Jo Goossens" <jo.goossens at hosted-power.com> 
To: "Atin Mukherjee" <amukherj at redhat.com> 
Cc: gluster-users at gluster.org 
Sent: Friday, October 27, 2017 11:06:35 PM 
Subject: Re: [Gluster-users] BUG: After stop and start wrong port is advertised 
BQ_BEGIN
RE: [Gluster-users] BUG: After stop and start wrong port is advertised 
Hello Atin, 
I just read it and very happy you found the issue. We really hope this will be
fixed in the next 3.10.7 version!
PS: Wow nice all that c code and those "goto out" statements (not
always considered clean but the best way often I think). Can remember the days I
wrote kernel drivers myself in c :)
Regards 
Jo Goossens 
BQ_BEGIN
-----Original message----- 
From: Atin Mukherjee <amukherj at redhat.com> 
Sent: Fri 27-10-2017 21:01 
Subject: Re: [Gluster-users] BUG: After stop and start wrong port is advertised 
To: Jo Goossens <jo.goossens at hosted-power.com>; 
CC: gluster-users at gluster.org; 
We (finally) figured out the root cause, Jo! 
Patch https://review.gluster.org/#/c/18579 posted upstream for review. 
On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens < jo.goossens at
hosted-power.com > wrote:
BQ_BEGIN
Hi, 
We use glusterfs 3.10.5 on Debian 9. 
When we stop or restart the service, e.g.: service glusterfs-server restart 
We see that the wrong port get's advertised afterwards. For example: 
Before restart: 
Status of volume: public 
Gluster process TCP Port RDMA Port Online Pid 
------------------------------------------------------------------------------ 
Brick 192.168.140.41:/gluster/public 49153 0 Y 6364 
Brick 192.168.140.42:/gluster/public 49152 0 Y 1483 
Brick 192.168.140.43:/gluster/public 49152 0 Y 5913 
Self-heal Daemon on localhost N/A N/A Y 5932 
Self-heal Daemon on 192.168.140.42 N/A N/A Y 13084 
Self-heal Daemon on 192.168.140.41 N/A N/A Y 15499 
Task Status of Volume public 
------------------------------------------------------------------------------ 
There are no active volume tasks 
After restart of the service on one of the nodes (192.168.140.43) the port seems
to have changed (but it didn't):
root at app3:/var/log/glusterfs# gluster volume status 
Status of volume: public 
Gluster process TCP Port RDMA Port Online Pid 
------------------------------------------------------------------------------ 
Brick 192.168.140.41:/gluster/public 49153 0 Y 6364 
Brick 192.168.140.42:/gluster/public 49152 0 Y 1483 
Brick 192.168.140.43:/gluster/public 49154 0 Y 5913 
Self-heal Daemon on localhost N/A N/A Y 4628 
Self-heal Daemon on 192.168.140.42 N/A N/A Y 3077 
Self-heal Daemon on 192.168.140.41 N/A N/A Y 28777 
Task Status of Volume public 
------------------------------------------------------------------------------ 
There are no active volume tasks 
However the active process is STILL the same pid AND still listening on the old
port
root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster 
tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 5913/glusterfsd 
The other nodes logs fill up with errors because they can't reach the daemon
anymore. They try to reach it on the "new" port instead of the old
one:
[2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
[2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
refused); disconnecting socket
So they now try 49154 instead of the old 49152 
Is this also by design? We had a lot of issues because of this recently. We
don't understand why it starts advertising a completely wrong port after
stop/start.
Regards 
Jo Goossens 
_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-users 
BQ_END
BQ_END
_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-users 
BQ_END
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171108/be3b55e2/attachment.html>