thr3ads.net - Gluster users - [Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster. [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Ernie Dunbar

2016-Apr-07 16:02 UTC

[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

On 2016-04-06 21:20, Atin Mukherjee wrote:> On 04/07/2016 04:04 AM, Ernie Dunbar wrote:
>> On 2016-04-06 11:42, Ernie Dunbar wrote:
>>> I've already successfully created a Gluster cluster, but when I
try
>>> to
>>> add a new node, gluster on the new node claims it can't find
the
>>> hostname of the first node in the cluster.
>>> 
>>> I've added the hostname nfs1.lightspeed.ca to /etc/hosts like
this:
>>> 
>>> root at nfs3:/home/ernied# cat /etc/hosts
>>> 127.0.0.1    localhost
>>> 192.168.1.31    nfs1.lightspeed.ca      nfs1
>>> 192.168.1.32    nfs2.lightspeed.ca      nfs2
>>> 127.0.1.1    nfs3.lightspeed.ca    nfs3
>>> 
>>> 
>>> # The following lines are desirable for IPv6 capable hosts
>>> ::1     localhost ip6-localhost ip6-loopback
>>> ff02::1 ip6-allnodes
>>> ff02::2 ip6-allrouters
>>> 
>>> I can ping the hostname:
>>> 
>>> root at nfs3:/home/ernied# ping -c 3 nfs1
>>> PING nfs1.lightspeed.ca (192.168.1.31) 56(84) bytes of data.
>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=1 ttl=64
>>> time=0.148 ms
>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=2 ttl=64
>>> time=0.126 ms
>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=3 ttl=64
>>> time=0.133 ms
>>> 
>>> --- nfs1.lightspeed.ca ping statistics ---
>>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms
>>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms
>>> 
>>> I can get gluster to probe the hostname:
>>> 
>>> root at nfs3:/home/ernied# gluster peer probe nfs1
>>> peer probe: success. Host nfs1 port 24007 already in peer list
>>> 
>>> But if I try to create the brick on the new node, it says that the
>>> host can't be found? Um...
>>> 
>>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3
>>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/
>>> nfs3.lightspeed.ca:/brick1/gv2
>>> volume create: gv2: failed: Failed to find host nfs1.lightspeed.ca
>>> 
>>> Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log:
>>> 
>>> [2016-04-06 18:19:18.107459] E [MSGID: 106452]
>>> [glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management:
>>> Failed to find host nfs1.lightspeed.ca
>>> [2016-04-06 18:19:18.107496] E [MSGID: 106536]
>>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume]
>>> 0-management: Failed to find host nfs1.lightspeed.ca
>>> [2016-04-06 18:19:18.107516] E [MSGID: 106301]
>>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of
>>> operation 'Volume Create' failed on localhost : Failed to
find host
>>> nfs1.lightspeed.ca
>>> [2016-04-06 18:19:18.231864] E [MSGID: 106170]
>>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req]
0-management:
>>> Request from peer 192.168.1.31:65530 has an entry in peerinfo, but
>>> uuid does not match
> We have introduced a new check to reject a peer if the request is 
> coming
> from a node where the hostname matches but UUID is different. This can
> happen if a node goes through a re-installation and its
> /var/lib/glusterd/* content is wiped off. Look at [1] for more details.
> 
> [1] http://review.gluster.org/13519
> 
> Do confirm if that's the case.

I couldn't say if that's *exactly* the case, but it's pretty close.
I
don't recall ever removing /var/lib/glusterd/* or any of its contents, 
but the operating system isn't exactly the way it was when I first tried 
to add this node to the cluster.

What should I do to *fix* the problem though, so I can add this node to 
the cluster? This bug report doesn't appear to provide a solution. I've 
tried removing the node from the cluster, and that failed too. Things 
seem to be in a very screwey state right now.
> 
>>> [2016-04-06 18:19:18.231919] E [MSGID: 106170]
>>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req]
0-management:
>>> Rejecting management handshake request from unknown peer
>>> 192.168.1.31:65530
>>> 
>>> That error about the entry in peerinfo doesn't match anything
in
>>> Google besides the source code for Gluster. My guess is that my
>>> earlier unsuccessful attempts to add this node before v3.7.10 have
>>> created a conflict that needs to be cleared.
>> 
>> 
>> More interesting, is what happens when I try to add the third server 
>> to
>> the brick from the first gluster server:
>> 
>> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3
>> nfs3:/brick1/gv2
>> volume add-brick: failed: One or more nodes do not support the 
>> required
>> op-version. Cluster op-version must atleast be 30600.
>> 
>> Yet, when I view the operating version in 
>> /var/lib/glusterd/glusterd.info:
>> 
>> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info
>> UUID=1207917a-23bc-4bae-8238-cd691b7082c7
>> operating-version=30501
>> 
>> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info
>> UUID=e394fcec-41da-482a-9b30-089f717c5c06
>> operating-version=30501
>> 
>> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info
>> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed
>> operating-version=30501
>> 
>> I see that the operating version is the same on all nodes!
> Here cluster op-version is pretty old. You need to make sure that you
> bump up the op-version by 'gluster volume set all cluster.op-version
> 30710'. add-brick code path has a check that your cluster op-version 
> has
> to be at least 30600 if you are with gluster version >=3.6 which is the
> case here.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>

Atin Mukherjee

2016-Apr-07 16:16 UTC

head link

[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

-Atin
Sent from one plus one
On 07-Apr-2016 9:32 pm, "Ernie Dunbar" <maillist at
lightspeed.ca> wrote:>
> On 2016-04-06 21:20, Atin Mukherjee wrote:
>>
>> On 04/07/2016 04:04 AM, Ernie Dunbar wrote:
>>>
>>> On 2016-04-06 11:42, Ernie Dunbar wrote:
>>>>
>>>> I've already successfully created a Gluster cluster, but
when I try to
>>>> add a new node, gluster on the new node claims it can't
find the
>>>> hostname of the first node in the cluster.
>>>>
>>>> I've added the hostname nfs1.lightspeed.ca to /etc/hosts
like this:
>>>>
>>>> root at nfs3:/home/ernied# cat /etc/hosts
>>>> 127.0.0.1    localhost
>>>> 192.168.1.31    nfs1.lightspeed.ca      nfs1
>>>> 192.168.1.32    nfs2.lightspeed.ca      nfs2
>>>> 127.0.1.1    nfs3.lightspeed.ca    nfs3
>>>>
>>>>
>>>> # The following lines are desirable for IPv6 capable hosts
>>>> ::1     localhost ip6-localhost ip6-loopback
>>>> ff02::1 ip6-allnodes
>>>> ff02::2 ip6-allrouters
>>>>
>>>> I can ping the hostname:
>>>>
>>>> root at nfs3:/home/ernied# ping -c 3 nfs1
>>>> PING nfs1.lightspeed.ca (192.168.1.31) 56(84) bytes of data.
>>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=1
ttl=64
>>>> time=0.148 ms
>>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=2
ttl=64
>>>> time=0.126 ms
>>>> 64 bytes from nfs1.lightspeed.ca (192.168.1.31): icmp_seq=3
ttl=64
>>>> time=0.133 ms
>>>>
>>>> --- nfs1.lightspeed.ca ping statistics ---
>>>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms
>>>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms
>>>>
>>>> I can get gluster to probe the hostname:
>>>>
>>>> root at nfs3:/home/ernied# gluster peer probe nfs1
>>>> peer probe: success. Host nfs1 port 24007 already in peer list
>>>>
>>>> But if I try to create the brick on the new node, it says that
the
>>>> host can't be found? Um...
>>>>
>>>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3
>>>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/
>>>> nfs3.lightspeed.ca:/brick1/gv2
>>>> volume create: gv2: failed: Failed to find host
nfs1.lightspeed.ca
>>>>
>>>> Our logs from
/var/log/glusterfs/etc-glusterfs-glusterd.vol.log:
>>>>
>>>> [2016-04-06 18:19:18.107459] E [MSGID: 106452]
>>>> [glusterd-utils.c:5825:glusterd_new_brick_validate]
0-management:
>>>> Failed to find host nfs1.lightspeed.ca
>>>> [2016-04-06 18:19:18.107496] E [MSGID: 106536]
>>>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume]
>>>> 0-management: Failed to find host nfs1.lightspeed.ca
>>>> [2016-04-06 18:19:18.107516] E [MSGID: 106301]
>>>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management:
Staging of
>>>> operation 'Volume Create' failed on localhost : Failed
to find host
>>>> nfs1.lightspeed.ca
>>>> [2016-04-06 18:19:18.231864] E [MSGID: 106170]
>>>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req]
0-management:
>>>> Request from peer 192.168.1.31:65530 has an entry in peerinfo,
but
>>>> uuid does not match
>>
>> We have introduced a new check to reject a peer if the request is
coming
>> from a node where the hostname matches but UUID is different. This can
>> happen if a node goes through a re-installation and its
>> /var/lib/glusterd/* content is wiped off. Look at [1] for more details.
>>
>> [1] http://review.gluster.org/13519
>>
>> Do confirm if that's the case.
>
>
>
> I couldn't say if that's *exactly* the case, but it's pretty
close. Idon't recall ever removing /var/lib/glusterd/* or any of its contents, but
the operating system isn't exactly the way it was when I first tried to add
this node to the cluster.>
> What should I do to *fix* the problem though, so I can add this node tothe cluster? This bug report doesn't appear to provide a solution. I've
tried removing the node from the cluster, and that failed too. Things seem
to be in a very screwey state right now.

I should have given the work around earlier. Find the peer file for the
faulty node in /var/lib/glusterd/peers/ and delete the same from all the
nodes but the faulty node. Restart glusterd instance on all those nodes.
Ensure /var/lib/glusterd/ content is empty, restart glusterd and then peer
probe this node from any of the node in the existing cluster. You should
also bump up the op-version once cluster is stable.
>
>
>>
>>>> [2016-04-06 18:19:18.231919] E [MSGID: 106170]
>>>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req]
0-management:
>>>> Rejecting management handshake request from unknown peer
>>>> 192.168.1.31:65530
>>>>
>>>> That error about the entry in peerinfo doesn't match
anything in
>>>> Google besides the source code for Gluster. My guess is that my
>>>> earlier unsuccessful attempts to add this node before v3.7.10
have
>>>> created a conflict that needs to be cleared.
>>>
>>>
>>>
>>> More interesting, is what happens when I try to add the third
server to
>>> the brick from the first gluster server:
>>>
>>> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3
>>> nfs3:/brick1/gv2
>>> volume add-brick: failed: One or more nodes do not support the
required
>>> op-version. Cluster op-version must atleast be 30600.
>>>
>>> Yet, when I view the operating version in /var/lib/glusterd/
glusterd.info:>>>
>>> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info
>>> UUID=1207917a-23bc-4bae-8238-cd691b7082c7
>>> operating-version=30501
>>>
>>> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info
>>> UUID=e394fcec-41da-482a-9b30-089f717c5c06
>>> operating-version=30501
>>>
>>> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info
>>> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed
>>> operating-version=30501
>>>
>>> I see that the operating version is the same on all nodes!
>>
>> Here cluster op-version is pretty old. You need to make sure that you
>> bump up the op-version by 'gluster volume set all
cluster.op-version
>> 30710'. add-brick code path has a check that your cluster
op-version has
>> to be at least 30600 if you are with gluster version >=3.6 which is
the
>> case here.
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160407/ce1e97bb/attachment.html>

Gluster users - Apr 2016 - Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.