I have three machines, all Ubuntu 12.04 running gluster 3.3.1. storage1 192.168.6.70 on 10G, 192.168.5.70 on 1G storage2 192.168.6.71 on 10G, 192.168.5.71 on 1G storage3 192.168.6.72 on 10G, 192.168.5.72 on 1G Each machine has two NICs, but on each host, /etc/hosts lists the 10G interface on all machines. storage1 and storage3 were taken away for hardware changes, which included swapping the boot disks. They had the O/S reinstalled. Somehow I have gotten into a state where "gluster peer status" is broken. [on storage1] # gluster peer status (Just hangs here until I press ^C) [on storage2] # gluster peer status Number of Peers: 2 Hostname: 192.168.6.70 Uuid: bf320f69-2713-4b57-9003-a721a8101bc6 State: Peer in Cluster (Connected) Hostname: storage3 Uuid: 1b058f9f-c116-496f-8b50-fb581f9625f0 State: Peer Rejected (Connected) << note "Rejected" [on storage3] # gluster peer status Number of Peers: 2 Hostname: 192.168.6.70 Uuid: 698ee46d-ab8c-45f6-a6b6-7af998430a37 State: Peer in Cluster (Connected) Hostname: storage2 Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59 State: Peer Rejected (Connected) << note "Rejected" Poking around the filesystem a bit: [on storage1] root at storage1:~# cat /var/lib/glusterd/glusterd.info UUID=bf320f69-2713-4b57-9003-a721a8101bc6 root at storage1:~# ls /var/lib/glusterd/peers/ 2c0670f4-c3ba-46e0-92a8-108e71832b59 root at storage1:~# head /var/lib/glusterd/peers/*uuid=2c0670f4-c3ba-46e0-92a8-108e71832b59 state=4 hostname1=storage2 [on storage2] # cat /var/lib/glusterd/glusterd.info UUID=2c0670f4-c3ba-46e0-92a8-108e71832b59 # head /var/lib/glusterd/peers/* ==> /var/lib/glusterd/peers/1b058f9f-c116-496f-8b50-fb581f9625f0 <= uuid=1b058f9f-c116-496f-8b50-fb581f9625f0 state=6 hostname1=storage3 ==> /var/lib/glusterd/peers/698ee46d-ab8c-45f6-a6b6-7af998430a37 <= uuid=bf320f69-2713-4b57-9003-a721a8101bc6 state=3 hostname1=192.168.6.70 [on storage3] # cat /var/lib/glusterd/glusterd.info UUID=1b058f9f-c116-496f-8b50-fb581f9625f0 # head /var/lib/glusterd/peers/* ==> /var/lib/glusterd/peers/2c0670f4-c3ba-46e0-92a8-108e71832b59 <= uuid=2c0670f4-c3ba-46e0-92a8-108e71832b59 state=6 hostname1=storage2 ==> /var/lib/glusterd/peers/698ee46d-ab8c-45f6-a6b6-7af998430a37 <= uuid=698ee46d-ab8c-45f6-a6b6-7af998430a37 state=3 hostname1=192.168.6.70 Obvious problems: - storage1 is known to its peers by IP address, not by hostname - storage3 has the wrong UUID for storage1 - storage2 and storage3 are failing to be peers, "Peer Rejected" whatever that means (however I do have clients accessing data on a volume on storage2 and a volume on storage3) On storage1, typing "gluster peer detach storage2" or "gluster peer detach storage3" just hangs. Detaching storage1 from the other side fails: root at storage2:~# gluster peer detach storage1 One of the peers is probably down. Check with 'peer status'. root at storage2:~# gluster peer detach 192.168.6.70 One of the peers is probably down. Check with 'peer status'. Then I found something very suspicious on storage1: root at storage1:~# tail /var/log/glusterfs/etc-glusterfs-glusterd.vol.log [2012-12-03 12:50:36.208029] I [glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared local lock [2012-12-03 12:51:05.023553] I [glusterd-handler.c:1168:glusterd_handle_sync_volume] 0-glusterd: Received volume sync req for volume all [2012-12-03 12:51:05.023741] I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by bf320f69-2713-4b57-9003-a721a8101bc6 [2012-12-03 12:51:05.023761] I [glusterd-handler.c:463:glusterd_op_txn_begin] 0-management: Acquired local lock [2012-12-03 12:51:05.024176] I [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC from uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59 [2012-12-03 12:51:05.024214] C [glusterd-op-sm.c:1946:glusterd_op_build_payload] 0-management: volname is not present in operation ctx pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 root at storage1:~# ps auxwww | grep gluster root 1584 0.0 0.1 230516 10668 ? Ssl 11:36 0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid root 6466 0.0 0.0 9392 920 pts/0 S+ 13:35 0:00 grep --color=auto gluster Hmm... so as you can see, there was a SEGV signal, however glusterd was still running. But stopping it and starting it I was able to do "gluster peer status" again. root at storage1:~# service glusterfs-server stop ps auxwww | grep glusterglusterfs-server stop/waiting root at storage1:~# ps auxwww | grep gluster root 6478 0.0 0.0 9388 920 pts/0 S+ 13:36 0:00 grep --color=auto gluster root at storage1:~# service glusterfs-server start glusterfs-server start/running, process 6485 root at storage1:~# gluster peer status Number of Peers: 1 Hostname: storage2 Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59 State: Peer in Cluster (Connected) root at storage1:~# But I still cannot detach from either side. From the storage1 side: root at storage1:~# gluster peer status Number of Peers: 1 Hostname: storage2 Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59 State: Peer in Cluster (Connected) root at storage1:~# gluster peer detach storage2 Brick(s) with the peer storage2 exist in cluster>From the storage2 side:root at storage2:~# gluster peer status Number of Peers: 2 Hostname: 192.168.6.70 Uuid: bf320f69-2713-4b57-9003-a721a8101bc6 State: Peer in Cluster (Connected) Hostname: storage3 Uuid: 1b058f9f-c116-496f-8b50-fb581f9625f0 State: Peer Rejected (Connected) root at storage2:~# gluster peer detach 192.168.6.70 One of the peers is probably down. Check with 'peer status'. root at storage2:~# gluster peer detach storage1 One of the peers is probably down. Check with 'peer status'. So this all looks broken, and as I can't find any gluster documentation saying what these various states mean, I'm not sure how to proceed. Any suggestions? Note: I have no replicated volumes, only distributed ones. Thanks, Brian.
On Mon, Dec 03, 2012 at 01:44:47PM +0000, Brian Candler wrote:> So this all looks broken, and as I can't find any gluster documentation > saying what these various states mean, I'm not sure how to proceed. Any > suggestions?Update. On storage1 and storage3 I killed all glusterfs(d) processes, did rm /var/lib/glusterd/peers/* rm -rf /var/lib/glusterd/vols/* and restarted glusterd. Then I did "gluster peer probe storage2". On the first attempt, I was getting State: Accepted peer request (Connected) Couldn't work out why it didn't move to full connected peer. But after detach and probe again, from storage3 I got State: Peer in Cluster (Connected) which suggests it is OK. However "gluster volume info" on both shows that I have lost the volume I had on storage3. Trying to recreate it: # gluster volume create scratch3 storage3:/disk/scratch/scratch3 /disk/scratch/scratch3 or a prefix of it is already part of a volume Now I do remember seeing something about a script to remove xattrs, but I can't find it in the ubuntu glusterfs-{server,common,client,examples} packages. Back to mailing list archives: http://www.mail-archive.com/gluster-users at gluster.org/msg09013.html So I did the two setfattr commands and was able to recreate the volume without loss of data. storage1 was a bit more awkward: root at storage1:/var/lib/glusterd# gluster peer status No peers present root at storage1:/var/lib/glusterd# gluster peer probe storage2 storage2 is already part of another cluster <<Digs around source code>> <<./xlators/mgmt/glusterd/src/glusterd-handler.c>> OK, because storage2 already has a peer, it looks like I have to probe storage1 from storage2, not the other way round. It works this time. So I think it's all working again now, but for someone who was not prepared to experiment and get dirty, it would have been a very hairy experience. I have to say that in my opinion, the two worst aspects of glusterfs by far are: - lack of error reporting, other than grubbing through log files on both client and server - lack of documentation (especially recovery procedures for things like failed bricks, replacing bricks, volume info out of sync, split-brain data out of sync) Unfortunately, live systems are not where you want to be experimenting :-( Regards, Brian.