On Fri, Feb 12, 2016 at 8:54 AM, Mike Stump <mikestump at comcast.net>
wrote:> So, I lost one of my servers and the OS was reinstalled. The gluster data
is on another disk that survives OS reinstalls. /var/lib/gluster however does
not.
>
> I was following the bring it back up directions, but before I did that, I
think a peer probe was done with the new uuid. This caused it to be dropped
from the cluster, entirely.
>
> I edited the uuid to be back what it was, but now it is no longer in the
cluster. The web site didn?t seem to have any help for how to undo the drop.
It was part of a replica 2 pair, and I would like to merely have it come up and
be apart of the cluster again. It has all the data (as I run with quorum and
all the replica 2 pair contents are R/O until this server comes back). I don?t
mind letting it refresh from the other pair member of the replica, even though
the data is already on disk.
>
> I tried:
>
> # gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
> volume replace-brick: failed: Host machine04 is not in 'Peer in
Cluster? state
>
> to try and let it resync into the cluster, but, it won?t let me replace the
brick. I can?t do:
>
> # gluster peer detach machine04
> peer detach: failed: Brick(s) with the peer machine04 exist in cluster
>
> either. What I wanted it to do, it when it connected to the cluster the
first time with the new uuid, the cluster should inform it it might have
filesystems on it (it comes in with a name already in the peer list), and get
brick information from the cluster and check it out. If it has those, it should
just notice the uuid is wrong, fix it, make it part of the cluster again, spin
it all up and continue on.
>
> I tried;
>
> # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
> volume add-brick: failed: Volume g2home does not exist
>
> and it didn?t work on either machine04, nor one of the peers:
>
> # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
> volume add-brick: failed: Operation failed
>
> So, to try and fix the Peer in Cluster issue, I stop and restarted
glistered many time, and eventually most all resented and came up into the Peer
in Cluster state. All except for 1 that was endlessly confused. So, if the
network works, it should wipe the peer, and just retry the entire state machine
to get back into the right state. I had to stop the server on the two machines
and then manually edit the state to be 3, and then restart them. It then at
least showed the right state on both.
>
> Next, let?s try and sync up the bricks:
>
> root at machine04:/# gluster volume sync machine00 all
> Sync volume may make data inaccessible while the sync is in progress. Do
you want to continue? (y/n) y
> volume sync: success
> root at machine04:/# gluster vol info
> No volumes present
>
> root at machine02:/# gluster volume heal g2home full
> Staging failed on machine04. Error: Volume g2home does not exist
>
> Think about that. This is a replica 2 server, the entire point would be to
fix up the array if one of the machines was screwy. heal seemed like the
command to fix it up.
>
> So, now that it is connected, let?s try this again:
>
> # gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
> volume replace-brick: failed: Pre Validation failed on machine04. volume:
g2home does not exist
>
> Nope, that won?t work. So, let?s try removing:
>
> # gluster vol remove-brick g2home replica 2 machine04:/.g/g2home
machine05:/.g/g2home start
> volume remove-brick start: failed: Staging failed on machine04. Please
check log file for details.
>
> Nope, that won?t either. What?s the point of remove, if it won?t work?
>
> Ok, fine, lets for for a bigger hammer:
>
> # gluster peer detach machine04 force
> peer detach: failed: Brick(s) with the peer machine04 exist in cluster
>
> Doh. I know that, but, it is a replica!
>
> [ more googling ]
>
> Someone said to just copy the entire vols directory. [ cross fingers ]
copy vols.
>
> Ok, I can now do a gluster volume status g2home detail, which I could not
before. Files seem to be R/W on the array now. I think that might have worked.
>
> So, why can?t gluster copy vols by itself, if indeed that is the right
thing to do?
Gluster should actually do that, provided the peer is in the 'Peer in
cluster' state.
>
> Why can?t the document say, just edit the state variable and just copy vols
to get it going again?
Which document did you refer to? I'm not aware of a document that
describes how to recover a peer after the loss of /var/lib/glusterd.
The following steps should have helped you get back the cluster into a
good state quickly.
On the newly reinstalled peer, before starting glusterd,
1. Create the /var/lib/glusterd/glusterd.info file and fill it with
the peer previous uuid and operating-version. The uuid can be obtained
from the peerinfo files in /var/lib/glusterd/peers on the other peers.
The operating-version from glusterd.info on the other peers.
2. From one of the other peers copy over /var/lib/glusterd/peers .
Remove the peerinfo file for this peer. This should allow glusterd on
this peer to accept connections from the rest of the cluster and also
connect to the rest of the cluster.
3. Start glusterd
4. The remaining information on volumes and other peers should be
synced over automatically, and the bricks and other daemons should
start running.
(We should probably put this down somewhere).
>
> Why can?t probe figure out that you were already part of a cluster, and
when it runs, it notices that your brains have been wiped, and just grab that
info from the cluster and bring the node back up? It can even run heal on the
data to ensure that nothing messed with it and that it matches the other
replica.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users