Please send us the glusterd log file along with cmd_history.log from all the 6 nodes. The logs you mentioned in the thread are not relevant to debug the issue. Which gluster version are you using? ~Atin On 06/13/2016 06:49 PM, Arif Ali wrote:> Hi all, > > Hopefully, someone can help > > We have a 6 node gluster setup, and have successfully got the gluster > system up and running, and had no issues with the initial install. > > For other reasons, we had to re-provision the nodes, and therefore we > had to go through some recovery steps to get the node back into the > system. The documentation I used was [1]. > > The key thing is that everything in the documentation worked without a > problem. The replication of gluster works, and can easily monitor that > through the heal commands. > > Unfortunately, we are not able to run "gluster volume status", which > hangs for a moment, and in the end we get "Error : Request timed out ". > Most of the log files are clean, except for > /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of > the contents > > [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive] > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid > argument > [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect] > 0-management: Failed to set keep-alive: Invalid argument > [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive] > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid > argument > [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect] > 0-management: Failed to set keep-alive: Invalid argument > > Any assistance on this would be much appreciated. > > [1] https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname > > -- > Arif Ali > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >
Hi Atin, I have sent the tar file of logs in a PM The version of gluster, that we have been using is # rpm -qa | grep gluster glusterfs-api-3.7.11-1.el7.x86_64 glusterfs-geo-replication-3.7.11-1.el7.x86_64 glusterfs-libs-3.7.11-1.el7.x86_64 glusterfs-client-xlators-3.7.11-1.el7.x86_64 glusterfs-fuse-3.7.11-1.el7.x86_64 glusterfs-server-3.7.11-1.el7.x86_64 glusterfs-3.7.11-1.el7.x86_64 glusterfs-cli-3.7.11-1.el7.x86_64 -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 13 June 2016 at 15:46, Atin Mukherjee <amukherj at redhat.com> wrote:> Please send us the glusterd log file along with cmd_history.log from all > the 6 nodes. The logs you mentioned in the thread are not relevant to > debug the issue. Which gluster version are you using? > > ~Atin > > On 06/13/2016 06:49 PM, Arif Ali wrote: > > Hi all, > > > > Hopefully, someone can help > > > > We have a 6 node gluster setup, and have successfully got the gluster > > system up and running, and had no issues with the initial install. > > > > For other reasons, we had to re-provision the nodes, and therefore we > > had to go through some recovery steps to get the node back into the > > system. The documentation I used was [1]. > > > > The key thing is that everything in the documentation worked without a > > problem. The replication of gluster works, and can easily monitor that > > through the heal commands. > > > > Unfortunately, we are not able to run "gluster volume status", which > > hangs for a moment, and in the end we get "Error : Request timed out ". > > Most of the log files are clean, except for > > /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of > > the contents > > > > [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive] > > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid > > argument > > [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect] > > 0-management: Failed to set keep-alive: Invalid argument > > [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive] > > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid > > argument > > [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect] > > 0-management: Failed to set keep-alive: Invalid argument > > > > Any assistance on this would be much appreciated. > > > > [1] > https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname > > > > -- > > Arif Ali > > > > IRC: arif-ali at freenode > > LinkedIn: http://uk.linkedin.com/in/arifali > > > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160613/cbee8790/attachment.html>
So the issue looks like an incorrect UUID got populated in the peer configuration which lead to this inconsistency and here is the log entry to prove this. I have a feeling that the steps were not properly performed or you missed to copy the older UUID of the failed node to the new one. [2016-06-13 18:25:09.738363] E [MSGID: 106170] [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req] 0-management: Request from peer 10.28.9.12:65299 has an entry in peerinfo, but uuid does not match To get rid of this situation you'd need to stop all the running glusterd instances and go into /var/lib/glusterd/peers folder on all the nodes and manually correct the UUID file names and their content if required. Just to give you an idea on how the peer configurations are structured and stored, here is an example: On a 3 node cluster (say N1, N2, N3) N1's UUID - dc07f77f-09f3-46f4-8d92-f2d7f6e627af (By 'cat /var/lib/glusterd/glusterd.info | grep UUID' on N1) N2's UUID - 02d157bd-a738-4914-991e-60953409f1b1 N3's UUID - 932186a6-4b29-4216-8da1-2fe193c928c1 N1's peer configuration ======================root at ebbc696b4dc4:/home/glusterfs# cd /var/lib/glusterd/peers/ root at ebbc696b4dc4:/var/lib/glusterd/peers# ls -lrt total 8 -rw------- 1 root root 71 Jun 15 05:01 02d157bd-a738-4914-991e-60953409f1b1 -----> N2's UUID -rw------- 1 root root 71 Jun 15 05:02 932186a6-4b29-4216-8da1-2fe193c928c1 - N3's UUID Content of other peers (N2,3) on N1's disk =========================================root at ebbc696b4dc4:/var/lib/glusterd/peers# cat 02d157bd-a738-4914-991e-60953409f1b1 uuid=02d157bd-a738-4914-991e-60953409f1b1 state=3 hostname1=172.17.0.3 root at ebbc696b4dc4:/var/lib/glusterd/peers# cat 932186a6-4b29-4216-8da1-2fe193c928c1 uuid=932186a6-4b29-4216-8da1-2fe193c928c1 state=3 hostname1=172.17.0.4 Similarly you will find the details of N1, N2 on N3 & N1 & N3 on N2. You'd need to validate this theory on all the nodes and correct the content and remove the unwanted UUIDs. Post that restarting all the glusterd instances should solve the problem. HTH, Atin On 06/13/2016 08:16 PM, Atin Mukherjee wrote:> Please send us the glusterd log file along with cmd_history.log from all > the 6 nodes. The logs you mentioned in the thread are not relevant to > debug the issue. Which gluster version are you using? > > ~Atin > > On 06/13/2016 06:49 PM, Arif Ali wrote: >> Hi all, >> >> Hopefully, someone can help >> >> We have a 6 node gluster setup, and have successfully got the gluster >> system up and running, and had no issues with the initial install. >> >> For other reasons, we had to re-provision the nodes, and therefore we >> had to go through some recovery steps to get the node back into the >> system. The documentation I used was [1]. >> >> The key thing is that everything in the documentation worked without a >> problem. The replication of gluster works, and can easily monitor that >> through the heal commands. >> >> Unfortunately, we are not able to run "gluster volume status", which >> hangs for a moment, and in the end we get "Error : Request timed out ". >> Most of the log files are clean, except for >> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of >> the contents >> >> [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive] >> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid >> argument >> [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect] >> 0-management: Failed to set keep-alive: Invalid argument >> [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive] >> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid >> argument >> [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect] >> 0-management: Failed to set keep-alive: Invalid argument >> >> Any assistance on this would be much appreciated. >> >> [1] https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname >> >> -- >> Arif Ali >> >> IRC: arif-ali at freenode >> LinkedIn: http://uk.linkedin.com/in/arifali >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >