Hi, I have a 3 nodes replica (including arbiter) volume with GlusterFS 3.8.11 and this night one of my nodes (node1) had an out of memory for some unknown reason and as such the Linux OOM killer has killed the glusterd and glusterfs process. I restarted the glusterd process but now that node is in "Peer Rejected" state from the other nodes and from itself it rejects the two other nodes as you can see below from the output of "gluster peer status": Number of Peers: 2 Hostname: arbiternode.domain.tld Uuid: 60a03a81-ba92-4b84-90fe-7b6e35a10975 State: Peer Rejected (Connected) Hostname: node2.domain.tld Uuid: 4834dceb-4356-4efb-ad8d-8baba44b967c State: Peer Rejected (Connected) I also rebooted my node1 just in case but that did not help. I read here http://www.spinics.net/lists/gluster-users/msg25803.html that the problem could have to do something with the volume info file, in my case I checked the file: /var/lib/glusterd/vols/myvolume/info and they are the same on node1 and arbiternode but on node2 the order of the following volume parameters are different: features.quota-deem-statfs=on features.inode-quota=on nfs.disable=on performance.readdir-ahead=on Could that be the reason why the peer is in rejected status? can I simply edit this file on node2 to re-order the parameters like on the other 2 nodes? What else should I do to investigate the reason for this rejected peer state? Thank you in advance for the help. Best, Mabi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170806/5dcd0a3a/attachment.html>
On 2017? 08? 06? 15:59, mabi wrote:> Hi, > > I have a 3 nodes replica (including arbiter) volume with GlusterFS > 3.8.11 and this night one of my nodes (node1) had an out of memory for > some unknown reason and as such the Linux OOM killer has killed the > glusterd and glusterfs process. I restarted the glusterd process but > now that node is in "Peer Rejected" state from the other nodes and > from itself it rejects the two other nodes as you can see below from > the output of "gluster peer status": > > Number of Peers: 2 > > Hostname: arbiternode.domain.tld > Uuid: 60a03a81-ba92-4b84-90fe-7b6e35a10975 > State: Peer Rejected (Connected) > > Hostname: node2.domain.tld > Uuid: 4834dceb-4356-4efb-ad8d-8baba44b967c > State: Peer Rejected (Connected) > > > > I also rebooted my node1 just in case but that did not help. > > I read here http://www.spinics.net/lists/gluster-users/msg25803.html > that the problem could have to do something with the volume info file, > in my case I checked the file: > > /var/lib/glusterd/vols/myvolume/info > > and they are the same on node1 and arbiternode but on node2 the order > of the following volume parameters are different: > > features.quota-deem-statfs=on > features.inode-quota=on > nfs.disable=on > performance.readdir-ahead=on > > Could that be the reason why the peer is in rejected status? can I > simply edit this file on node2 to re-order the parameters like on the > other 2 nodes? > > What else should I do to investigate the reason for this rejected peer > state? > > Thank you in advance for the help. > > Best, > Mabi > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-usersHi mabi. In my opinion, It caused by some volfile/checksum mismatch. try to look glusterd log file(/var/log/glusterfs/glusterd.log) in REJECTED node, and find some log like below [2014-06-17 04:21:11.266398] I [glusterd-handler.c:2050:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 81857e74-a726-4f48-8d1b-c2a4bdbc094f [2014-06-17 04:21:11.266485] E [glusterd-utils.c:2373:glusterd_compare_friend_volume] 0-management: Cksums of volume supportgfs differ. local cksum = 52468988, remote cksum = 2201279699 on peer 172.26.178.254 [2014-06-17 04:21:11.266542] I [glusterd-handler.c:3085:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 172.26.178.254 (0), ret: 0 [2014-06-17 04:21:11.272206] I [glusterd-rpc-ops.c:356:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 81857e74-a726-4f48-8d1b-c2a4bdbc094f, host: 172.26.178.254, port: 0 if it is, you need to sync volfile files/directories under /var/lib/glusterd/vols/<VOLNAME> from one of GOOD nodes. for details to resolve this problem, please show more information such as glusterd log :) -- Best regards. -- Ji-Hyeon Gim Research Engineer, Gluesys Address. Gluesys R&D Center, 5F, 11-31, Simin-daero 327beon-gil, Dongan-gu, Anyang-si, Gyeonggi-do, Korea (14055) Phone. +82-70-8787-1053 Fax. +82-31-388-3261 Mobile. +82-10-7293-8858 E-Mail. potatogim at potatogim.net Website. www.potatogim.net The time I wasted today is the tomorrow the dead man was eager to see yesterday. - Sophocles -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 520 bytes Desc: OpenPGP digital signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170806/8223c2aa/attachment.sig>
Hi Ji-Hyeon, Thanks to your help I could find out the problematic file. This would be the quota file of my volume it has a different checksum on node1 whereas node2 and arbiternode have the same checksum. This is expected as I had issues which my quota file and had to fix it manually with a script (more details on this mailing list in a previous post) and I only did that on node1. So what I now did is to copy /var/lib/glusterd/vols/myvolume/quota.conf file from node1 to node2 and arbiternode and then restart the glusterd process on node1 but somehow this did not fix the issue. I suppose I am missing a step here and maybe you have an idea what? Here would be the relevant part of my glusterd.log file taken from node1: [2017-08-06 08:16:57.699131] E [MSGID: 106012] [glusterd-utils.c:2988:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvolume differ. local cksum = 3823389269, remote cksum = 733515336 on peer node2.domain.tld [2017-08-06 08:16:57.275558] E [MSGID: 106012] [glusterd-utils.c:2988:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvolume differ. local cksum = 3823389269, remote cksum = 733515336 on peer arbiternode.intra.oriented.ch Best regards, Mabi> -------- Original Message -------- > Subject: Re: [Gluster-users] State: Peer Rejected (Connected) > Local Time: August 6, 2017 9:31 AM > UTC Time: August 6, 2017 7:31 AM > From: potatogim at potatogim.net > To: mabi <mabi at protonmail.ch> > Gluster Users <gluster-users at gluster.org> > On 2017? 08? 06? 15:59, mabi wrote: >> Hi, >> >> I have a 3 nodes replica (including arbiter) volume with GlusterFS >> 3.8.11 and this night one of my nodes (node1) had an out of memory for >> some unknown reason and as such the Linux OOM killer has killed the >> glusterd and glusterfs process. I restarted the glusterd process but >> now that node is in "Peer Rejected" state from the other nodes and >> from itself it rejects the two other nodes as you can see below from >> the output of "gluster peer status": >> >> Number of Peers: 2 >> >> Hostname: arbiternode.domain.tld >> Uuid: 60a03a81-ba92-4b84-90fe-7b6e35a10975 >> State: Peer Rejected (Connected) >> >> Hostname: node2.domain.tld >> Uuid: 4834dceb-4356-4efb-ad8d-8baba44b967c >> State: Peer Rejected (Connected) >> >> >> >> I also rebooted my node1 just in case but that did not help. >> >> I read here http://www.spinics.net/lists/gluster-users/msg25803.html >> that the problem could have to do something with the volume info file, >> in my case I checked the file: >> >> /var/lib/glusterd/vols/myvolume/info >> >> and they are the same on node1 and arbiternode but on node2 the order >> of the following volume parameters are different: >> >> features.quota-deem-statfs=on >> features.inode-quota=on >> nfs.disable=on >> performance.readdir-ahead=on >> >> Could that be the reason why the peer is in rejected status? can I >> simply edit this file on node2 to re-order the parameters like on the >> other 2 nodes? >> >> What else should I do to investigate the reason for this rejected peer >> state? >> >> Thank you in advance for the help. >> >> Best, >> Mabi >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users > Hi mabi. > In my opinion, It caused by some volfile/checksum mismatch. try to look > glusterd log file(/var/log/glusterfs/glusterd.log) in REJECTED node, and > find some log like below > [2014-06-17 04:21:11.266398] I [glusterd-handler.c:2050:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 81857e74-a726-4f48-8d1b-c2a4bdbc094f > [2014-06-17 04:21:11.266485] E [glusterd-utils.c:2373:glusterd_compare_friend_volume] 0-management: Cksums of volume supportgfs differ. local cksum = 52468988, remote cksum = 2201279699 on peer 172.26.178.254 > [2014-06-17 04:21:11.266542] I [glusterd-handler.c:3085:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 172.26.178.254 (0), ret: 0 > [2014-06-17 04:21:11.272206] I [glusterd-rpc-ops.c:356:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 81857e74-a726-4f48-8d1b-c2a4bdbc094f, host: 172.26.178.254, port: 0 > if it is, you need to sync volfile files/directories under > /var/lib/glusterd/vols/<VOLNAME> from one of GOOD nodes. > for details to resolve this problem, please show more information such > as glusterd log :) > -- > Best regards. > -- > Ji-Hyeon Gim > Research Engineer, Gluesys > Address. Gluesys R&D Center, 5F, 11-31, Simin-daero 327beon-gil, > Dongan-gu, Anyang-si, > Gyeonggi-do, Korea > (14055) > Phone. +82-70-8787-1053 > Fax. +82-31-388-3261 > Mobile. +82-10-7293-8858 > E-Mail. potatogim at potatogim.net > Website. www.potatogim.net > The time I wasted today is the tomorrow the dead man was eager to see yesterday. > - Sophocles-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170806/691afe63/attachment.html>