Ernie Dunbar
2016-Apr-07 16:02 UTC
[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.
On 2016-04-06 21:20, Atin Mukherjee wrote:> On 04/07/2016 04:04 AM, Ernie Dunbar wrote: >> On 2016-04-06 11:42, Ernie Dunbar wrote: >>> I've already successfully created a Gluster cluster, but when I try >>> to >>> add a new node, gluster on the new node claims it can't find the >>> hostname of the first node in the cluster. >>> >>> I've added the hostname nfs1.lightspeed.ca to /etc/hosts like this: >>> >>> root at nfs3:/home/ernied# cat /etc/hosts >>> 127.0.0.1 localhost >>> 192.168.1.31 nfs1.lightspeed.ca nfs1 >>> 192.168.1.32 nfs2.lightspeed.ca nfs2 >>> 127.0.1.1 nfs3.lightspeed.ca nfs3 >>> >>> >>> # The following lines are desirable for IPv6 capable hosts >>> ::1 localhost ip6-localhost ip6-loopback >>> ff02::1 ip6-allnodes >>> ff02::2 ip6-allrouters >>> >>> I can ping the hostname: >>> >>> root at nfs3:/home/ernied# ping -c 3 nfs1 >>> PING nfs1.lightspeed.ca (192.168.1.31) 56(84) bytes of data. >>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=1 ttl=64 >>> time=0.148 ms >>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=2 ttl=64 >>> time=0.126 ms >>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=3 ttl=64 >>> time=0.133 ms >>> >>> --- nfs1.lightspeed.ca ping statistics --- >>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms >>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms >>> >>> I can get gluster to probe the hostname: >>> >>> root at nfs3:/home/ernied# gluster peer probe nfs1 >>> peer probe: success. Host nfs1 port 24007 already in peer list >>> >>> But if I try to create the brick on the new node, it says that the >>> host can't be found? Um... >>> >>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3 >>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/ >>> nfs3.lightspeed.ca:/brick1/gv2 >>> volume create: gv2: failed: Failed to find host nfs1.lightspeed.ca >>> >>> Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log: >>> >>> [2016-04-06 18:19:18.107459] E [MSGID: 106452] >>> [glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management: >>> Failed to find host nfs1.lightspeed.ca >>> [2016-04-06 18:19:18.107496] E [MSGID: 106536] >>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume] >>> 0-management: Failed to find host nfs1.lightspeed.ca >>> [2016-04-06 18:19:18.107516] E [MSGID: 106301] >>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of >>> operation 'Volume Create' failed on localhost : Failed to find host >>> nfs1.lightspeed.ca >>> [2016-04-06 18:19:18.231864] E [MSGID: 106170] >>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req] 0-management: >>> Request from peer 192.168.1.31:65530 has an entry in peerinfo, but >>> uuid does not match > We have introduced a new check to reject a peer if the request is > coming > from a node where the hostname matches but UUID is different. This can > happen if a node goes through a re-installation and its > /var/lib/glusterd/* content is wiped off. Look at [1] for more details. > > [1] http://review.gluster.org/13519 > > Do confirm if that's the case.I couldn't say if that's *exactly* the case, but it's pretty close. I don't recall ever removing /var/lib/glusterd/* or any of its contents, but the operating system isn't exactly the way it was when I first tried to add this node to the cluster. What should I do to *fix* the problem though, so I can add this node to the cluster? This bug report doesn't appear to provide a solution. I've tried removing the node from the cluster, and that failed too. Things seem to be in a very screwey state right now.> >>> [2016-04-06 18:19:18.231919] E [MSGID: 106170] >>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req] 0-management: >>> Rejecting management handshake request from unknown peer >>> 192.168.1.31:65530 >>> >>> That error about the entry in peerinfo doesn't match anything in >>> Google besides the source code for Gluster. My guess is that my >>> earlier unsuccessful attempts to add this node before v3.7.10 have >>> created a conflict that needs to be cleared. >> >> >> More interesting, is what happens when I try to add the third server >> to >> the brick from the first gluster server: >> >> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3 >> nfs3:/brick1/gv2 >> volume add-brick: failed: One or more nodes do not support the >> required >> op-version. Cluster op-version must atleast be 30600. >> >> Yet, when I view the operating version in >> /var/lib/glusterd/glusterd.info: >> >> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info >> UUID=1207917a-23bc-4bae-8238-cd691b7082c7 >> operating-version=30501 >> >> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info >> UUID=e394fcec-41da-482a-9b30-089f717c5c06 >> operating-version=30501 >> >> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info >> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed >> operating-version=30501 >> >> I see that the operating version is the same on all nodes! > Here cluster op-version is pretty old. You need to make sure that you > bump up the op-version by 'gluster volume set all cluster.op-version > 30710'. add-brick code path has a check that your cluster op-version > has > to be at least 30600 if you are with gluster version >=3.6 which is the > case here. >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >>
Atin Mukherjee
2016-Apr-07 16:16 UTC
[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.
-Atin Sent from one plus one On 07-Apr-2016 9:32 pm, "Ernie Dunbar" <maillist at lightspeed.ca> wrote:> > On 2016-04-06 21:20, Atin Mukherjee wrote: >> >> On 04/07/2016 04:04 AM, Ernie Dunbar wrote: >>> >>> On 2016-04-06 11:42, Ernie Dunbar wrote: >>>> >>>> I've already successfully created a Gluster cluster, but when I try to >>>> add a new node, gluster on the new node claims it can't find the >>>> hostname of the first node in the cluster. >>>> >>>> I've added the hostname nfs1.lightspeed.ca to /etc/hosts like this: >>>> >>>> root at nfs3:/home/ernied# cat /etc/hosts >>>> 127.0.0.1 localhost >>>> 192.168.1.31 nfs1.lightspeed.ca nfs1 >>>> 192.168.1.32 nfs2.lightspeed.ca nfs2 >>>> 127.0.1.1 nfs3.lightspeed.ca nfs3 >>>> >>>> >>>> # The following lines are desirable for IPv6 capable hosts >>>> ::1 localhost ip6-localhost ip6-loopback >>>> ff02::1 ip6-allnodes >>>> ff02::2 ip6-allrouters >>>> >>>> I can ping the hostname: >>>> >>>> root at nfs3:/home/ernied# ping -c 3 nfs1 >>>> PING nfs1.lightspeed.ca (192.168.1.31) 56(84) bytes of data. >>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=1 ttl=64 >>>> time=0.148 ms >>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=2 ttl=64 >>>> time=0.126 ms >>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=3 ttl=64 >>>> time=0.133 ms >>>> >>>> --- nfs1.lightspeed.ca ping statistics --- >>>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms >>>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms >>>> >>>> I can get gluster to probe the hostname: >>>> >>>> root at nfs3:/home/ernied# gluster peer probe nfs1 >>>> peer probe: success. Host nfs1 port 24007 already in peer list >>>> >>>> But if I try to create the brick on the new node, it says that the >>>> host can't be found? Um... >>>> >>>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3 >>>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/ >>>> nfs3.lightspeed.ca:/brick1/gv2 >>>> volume create: gv2: failed: Failed to find host nfs1.lightspeed.ca >>>> >>>> Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log: >>>> >>>> [2016-04-06 18:19:18.107459] E [MSGID: 106452] >>>> [glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management: >>>> Failed to find host nfs1.lightspeed.ca >>>> [2016-04-06 18:19:18.107496] E [MSGID: 106536] >>>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume] >>>> 0-management: Failed to find host nfs1.lightspeed.ca >>>> [2016-04-06 18:19:18.107516] E [MSGID: 106301] >>>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of >>>> operation 'Volume Create' failed on localhost : Failed to find host >>>> nfs1.lightspeed.ca >>>> [2016-04-06 18:19:18.231864] E [MSGID: 106170] >>>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req] 0-management: >>>> Request from peer 192.168.1.31:65530 has an entry in peerinfo, but >>>> uuid does not match >> >> We have introduced a new check to reject a peer if the request is coming >> from a node where the hostname matches but UUID is different. This can >> happen if a node goes through a re-installation and its >> /var/lib/glusterd/* content is wiped off. Look at [1] for more details. >> >> [1] http://review.gluster.org/13519 >> >> Do confirm if that's the case. > > > > I couldn't say if that's *exactly* the case, but it's pretty close. Idon't recall ever removing /var/lib/glusterd/* or any of its contents, but the operating system isn't exactly the way it was when I first tried to add this node to the cluster.> > What should I do to *fix* the problem though, so I can add this node tothe cluster? This bug report doesn't appear to provide a solution. I've tried removing the node from the cluster, and that failed too. Things seem to be in a very screwey state right now. I should have given the work around earlier. Find the peer file for the faulty node in /var/lib/glusterd/peers/ and delete the same from all the nodes but the faulty node. Restart glusterd instance on all those nodes. Ensure /var/lib/glusterd/ content is empty, restart glusterd and then peer probe this node from any of the node in the existing cluster. You should also bump up the op-version once cluster is stable.> > >> >>>> [2016-04-06 18:19:18.231919] E [MSGID: 106170] >>>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req] 0-management: >>>> Rejecting management handshake request from unknown peer >>>> 192.168.1.31:65530 >>>> >>>> That error about the entry in peerinfo doesn't match anything in >>>> Google besides the source code for Gluster. My guess is that my >>>> earlier unsuccessful attempts to add this node before v3.7.10 have >>>> created a conflict that needs to be cleared. >>> >>> >>> >>> More interesting, is what happens when I try to add the third server to >>> the brick from the first gluster server: >>> >>> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3 >>> nfs3:/brick1/gv2 >>> volume add-brick: failed: One or more nodes do not support the required >>> op-version. Cluster op-version must atleast be 30600. >>> >>> Yet, when I view the operating version in /var/lib/glusterd/glusterd.info:>>> >>> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info >>> UUID=1207917a-23bc-4bae-8238-cd691b7082c7 >>> operating-version=30501 >>> >>> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info >>> UUID=e394fcec-41da-482a-9b30-089f717c5c06 >>> operating-version=30501 >>> >>> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info >>> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed >>> operating-version=30501 >>> >>> I see that the operating version is the same on all nodes! >> >> Here cluster op-version is pretty old. You need to make sure that you >> bump up the op-version by 'gluster volume set all cluster.op-version >> 30710'. add-brick code path has a check that your cluster op-version has >> to be at least 30600 if you are with gluster version >=3.6 which is the >> case here. >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://www.gluster.org/mailman/listinfo/gluster-users >>> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160407/ce1e97bb/attachment.html>