thr3ads.net - Gluster users - [Gluster-users] issues recovering machine in gluster [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Atin Mukherjee

2016-Jun-13 14:46 UTC

[Gluster-users] issues recovering machine in gluster

Please send us the glusterd log file along with cmd_history.log from all
the 6 nodes. The logs you mentioned in the thread are not relevant to
debug the issue. Which gluster version are you using?

~Atin

On 06/13/2016 06:49 PM, Arif Ali wrote:> Hi all,
> 
> Hopefully, someone can help
> 
> We have a 6 node gluster setup, and have successfully got the gluster
> system up and running, and had no issues with the initial install.
> 
> For other reasons, we had to re-provision the nodes, and therefore we
> had to go through some recovery steps to get the node back into the
> system. The documentation I used was [1].
> 
> The key thing is that everything in the documentation worked without a
> problem. The replication of gluster works, and can easily monitor that
> through the heal commands.
> 
> Unfortunately, we are not able to run "gluster volume status",
which
> hangs for a moment, and in the end we get "Error : Request timed out
".
> Most of the log files are clean, except for
> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
> the contents
> 
> [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> argument
> [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
> 0-management: Failed to set keep-alive: Invalid argument
> [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> argument
> [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
> 0-management: Failed to set keep-alive: Invalid argument
> 
> Any assistance on this would be much appreciated.
> 
> [1]
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
> 
> --
> Arif Ali
> 
> IRC: arif-ali at freenode
> LinkedIn: http://uk.linkedin.com/in/arifali
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>

Arif Ali

2016-Jun-13 15:38 UTC

head link

[Gluster-users] issues recovering machine in gluster

Hi Atin,

I have sent the tar file of logs in a PM

The version of gluster, that we have been using is

# rpm -qa | grep gluster
glusterfs-api-3.7.11-1.el7.x86_64
glusterfs-geo-replication-3.7.11-1.el7.x86_64
glusterfs-libs-3.7.11-1.el7.x86_64
glusterfs-client-xlators-3.7.11-1.el7.x86_64
glusterfs-fuse-3.7.11-1.el7.x86_64
glusterfs-server-3.7.11-1.el7.x86_64
glusterfs-3.7.11-1.el7.x86_64
glusterfs-cli-3.7.11-1.el7.x86_64

--
Arif Ali

IRC: arif-ali at freenode
LinkedIn: http://uk.linkedin.com/in/arifali

On 13 June 2016 at 15:46, Atin Mukherjee <amukherj at redhat.com> wrote:
> Please send us the glusterd log file along with cmd_history.log from all
> the 6 nodes. The logs you mentioned in the thread are not relevant to
> debug the issue. Which gluster version are you using?
>
> ~Atin
>
> On 06/13/2016 06:49 PM, Arif Ali wrote:
> > Hi all,
> >
> > Hopefully, someone can help
> >
> > We have a 6 node gluster setup, and have successfully got the gluster
> > system up and running, and had no issues with the initial install.
> >
> > For other reasons, we had to re-provision the nodes, and therefore we
> > had to go through some recovery steps to get the node back into the
> > system. The documentation I used was [1].
> >
> > The key thing is that everything in the documentation worked without a
> > problem. The replication of gluster works, and can easily monitor that
> > through the heal commands.
> >
> > Unfortunately, we are not able to run "gluster volume
status", which
> > hangs for a moment, and in the end we get "Error : Request timed
out ".
> > Most of the log files are clean, except for
> > /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some
of
> > the contents
> >
> > [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> > argument
> > [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
> > 0-management: Failed to set keep-alive: Invalid argument
> > [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> > argument
> > [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
> > 0-management: Failed to set keep-alive: Invalid argument
> >
> > Any assistance on this would be much appreciated.
> >
> > [1]
>
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
> >
> > --
> > Arif Ali
> >
> > IRC: arif-ali at freenode
> > LinkedIn: http://uk.linkedin.com/in/arifali
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160613/cbee8790/attachment.html>

Atin Mukherjee

2016-Jun-15 05:09 UTC

head link

[Gluster-users] issues recovering machine in gluster

So the issue looks like an incorrect UUID got populated in the peer
configuration which lead to this inconsistency and here is the log entry
to prove this. I have a feeling that the steps were not properly
performed or you missed to copy the older UUID of the failed node to the
new one.

[2016-06-13 18:25:09.738363] E [MSGID: 106170]
[glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req] 0-management:
Request from peer 10.28.9.12:65299 has an entry in peerinfo, but uuid
does not match

To get rid of this situation you'd need to stop all the running glusterd
instances and go into /var/lib/glusterd/peers folder on all the nodes
and manually correct the UUID file names and their content if required.

Just to give you an idea on how the peer configurations are structured
and stored, here is an example:

On a 3 node cluster (say N1, N2, N3)
N1's UUID - dc07f77f-09f3-46f4-8d92-f2d7f6e627af

(By 'cat /var/lib/glusterd/glusterd.info | grep UUID' on N1)

N2's UUID -  02d157bd-a738-4914-991e-60953409f1b1
N3's UUID -  932186a6-4b29-4216-8da1-2fe193c928c1

N1's peer configuration
======================root at ebbc696b4dc4:/home/glusterfs# cd
/var/lib/glusterd/peers/
root at ebbc696b4dc4:/var/lib/glusterd/peers# ls -lrt
total 8
-rw------- 1 root root 71 Jun 15 05:01
02d157bd-a738-4914-991e-60953409f1b1   -----> N2's UUID
-rw------- 1 root root 71 Jun 15 05:02
932186a6-4b29-4216-8da1-2fe193c928c1 - N3's UUID


Content of other peers (N2,3) on N1's disk
=========================================root at
ebbc696b4dc4:/var/lib/glusterd/peers# cat
02d157bd-a738-4914-991e-60953409f1b1
uuid=02d157bd-a738-4914-991e-60953409f1b1
state=3
hostname1=172.17.0.3

root at ebbc696b4dc4:/var/lib/glusterd/peers# cat
932186a6-4b29-4216-8da1-2fe193c928c1
uuid=932186a6-4b29-4216-8da1-2fe193c928c1
state=3
hostname1=172.17.0.4

Similarly you will find the details of N1, N2 on N3 & N1 & N3 on N2.

You'd need to validate this theory on all the nodes and correct the
content and remove the unwanted UUIDs. Post that restarting all the
glusterd instances should solve the problem.

HTH,
Atin


On 06/13/2016 08:16 PM, Atin Mukherjee wrote:> Please send us the glusterd log file along with cmd_history.log from all
> the 6 nodes. The logs you mentioned in the thread are not relevant to
> debug the issue. Which gluster version are you using?
> 
> ~Atin
> 
> On 06/13/2016 06:49 PM, Arif Ali wrote:
>> Hi all,
>>
>> Hopefully, someone can help
>>
>> We have a 6 node gluster setup, and have successfully got the gluster
>> system up and running, and had no issues with the initial install.
>>
>> For other reasons, we had to re-provision the nodes, and therefore we
>> had to go through some recovery steps to get the node back into the
>> system. The documentation I used was [1].
>>
>> The key thing is that everything in the documentation worked without a
>> problem. The replication of gluster works, and can easily monitor that
>> through the heal commands.
>>
>> Unfortunately, we are not able to run "gluster volume
status", which
>> hangs for a moment, and in the end we get "Error : Request timed
out ".
>> Most of the log files are clean, except for
>> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some
of
>> the contents
>>
>> [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
>> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> argument
>> [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
>> 0-management: Failed to set keep-alive: Invalid argument
>> [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
>> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> argument
>> [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
>> 0-management: Failed to set keep-alive: Invalid argument
>>
>> Any assistance on this would be much appreciated.
>>
>> [1]
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
>>
>> --
>> Arif Ali
>>
>> IRC: arif-ali at freenode
>> LinkedIn: http://uk.linkedin.com/in/arifali
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>

Gluster users - Jun 2016 - issues recovering machine in gluster

[Gluster-users] issues recovering machine in gluster

[Gluster-users] issues recovering machine in gluster

[Gluster-users] issues recovering machine in gluster