thr3ads.net - Gluster users - [Gluster-users] poor OS reinstalls with 3.7 [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Mike Stump

2016-Feb-12 03:24 UTC

[Gluster-users] poor OS reinstalls with 3.7

So, I lost one of my servers and the OS was reinstalled.  The gluster data is on
another disk that survives OS reinstalls.  /var/lib/gluster however does not.

I was following the bring it back up directions, but before I did that, I think
a peer probe was done with the new uuid.  This caused it to be dropped from the
cluster, entirely.

I edited the uuid to be back what it was, but now it is no longer in the
cluster.  The web site didn?t seem to have any help for how to undo the drop. 
It was part of a replica 2 pair, and I would like to merely have it come up and
be apart of the cluster again.  It has all the data (as I run with quorum and
all the replica 2 pair contents are R/O until this server comes back).  I don?t
mind letting it refresh from the other pair member of the replica, even though
the data is already on disk.

I tried:

# gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
volume replace-brick: failed: Host machine04 is not in 'Peer in Cluster?
state

to try and let it resync into the cluster, but, it won?t let me replace the
brick.  I can?t do:

# gluster peer detach machine04
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

either.  What I wanted it to do, it when it connected to the cluster the first
time with the new uuid, the cluster should inform it it might have filesystems
on it (it comes in with a name already in the peer list), and get brick
information from the cluster and check it out.  If it has those, it should just
notice the uuid is wrong, fix it, make it part of the cluster again, spin it all
up and continue on.

I tried;

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Volume g2home does not exist

and it didn?t work on either machine04, nor one of the peers:

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Operation failed

So, to try and fix the Peer in Cluster issue, I stop and restarted glistered
many time, and eventually most all resented and came up into the Peer in Cluster
state.  All except for 1 that was endlessly confused.  So, if the network works,
it should wipe the peer, and just retry the entire state machine to get back
into the right state.  I had to stop the server on the two machines and then
manually edit the state to be 3, and then restart them.  It then at least showed
the right state on both.

Next, let?s try and sync up the bricks:

root at machine04:/# gluster volume sync machine00 all
Sync volume may make data inaccessible while the sync is in progress. Do you
want to continue? (y/n) y
volume sync: success
root at machine04:/# gluster vol info
No volumes present

root at machine02:/# gluster volume heal g2home full
Staging failed on machine04. Error: Volume g2home does not exist

Think about that.  This is a replica 2 server, the entire point would be to fix
up the array if one of the machines was screwy.  heal seemed like the command to
fix it up.

So, now that it is connected, let?s try this again:

# gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
volume replace-brick: failed: Pre Validation failed on machine04. volume: g2home
does not exist

Nope, that won?t work.  So, let?s try removing:

# gluster vol remove-brick g2home replica 2 machine04:/.g/g2home
machine05:/.g/g2home start
volume remove-brick start: failed: Staging failed on machine04. Please check log
file for details.

Nope, that won?t either. What?s the point of remove, if it won?t work?

Ok, fine, lets for for a bigger hammer:

# gluster peer detach machine04 force
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

Doh.  I know that, but, it is a replica!

[ more googling ]

Someone said to just copy the entire vols directory.  [ cross fingers ] copy
vols.

Ok, I can now do a gluster volume status g2home detail, which I could not
before.  Files seem to be R/W on the array now.  I think that might have worked.

So, why can?t gluster copy vols by itself, if indeed that is the right thing to
do?

Why can?t the document say, just edit the state variable and just copy vols to
get it going again?

Why can?t probe figure out that you were already part of a cluster, and when it
runs, it notices that your brains have been wiped, and just grab that info from
the cluster and bring the node back up?  It can even run heal on the data to
ensure that nothing messed with it and that it matches the other replica.

Kaushal M

2016-Feb-12 16:38 UTC

head link

[Gluster-users] poor OS reinstalls with 3.7

On Fri, Feb 12, 2016 at 8:54 AM, Mike Stump <mikestump at comcast.net>
wrote:> So, I lost one of my servers and the OS was reinstalled.  The gluster data
is on another disk that survives OS reinstalls.  /var/lib/gluster however does
not.
>
> I was following the bring it back up directions, but before I did that, I
think a peer probe was done with the new uuid.  This caused it to be dropped
from the cluster, entirely.
>
> I edited the uuid to be back what it was, but now it is no longer in the
cluster.  The web site didn?t seem to have any help for how to undo the drop. 
It was part of a replica 2 pair, and I would like to merely have it come up and
be apart of the cluster again.  It has all the data (as I run with quorum and
all the replica 2 pair contents are R/O until this server comes back).  I don?t
mind letting it refresh from the other pair member of the replica, even though
the data is already on disk.
>
> I tried:
>
> # gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
> volume replace-brick: failed: Host machine04 is not in 'Peer in
Cluster? state
>
> to try and let it resync into the cluster, but, it won?t let me replace the
brick.  I can?t do:
>
> # gluster peer detach machine04
> peer detach: failed: Brick(s) with the peer machine04 exist in cluster
>
> either.  What I wanted it to do, it when it connected to the cluster the
first time with the new uuid, the cluster should inform it it might have
filesystems on it (it comes in with a name already in the peer list), and get
brick information from the cluster and check it out.  If it has those, it should
just notice the uuid is wrong, fix it, make it part of the cluster again, spin
it all up and continue on.
>
> I tried;
>
> # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
> volume add-brick: failed: Volume g2home does not exist
>
> and it didn?t work on either machine04, nor one of the peers:
>
> # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
> volume add-brick: failed: Operation failed
>
> So, to try and fix the Peer in Cluster issue, I stop and restarted
glistered many time, and eventually most all resented and came up into the Peer
in Cluster state.  All except for 1 that was endlessly confused.  So, if the
network works, it should wipe the peer, and just retry the entire state machine
to get back into the right state.  I had to stop the server on the two machines
and then manually edit the state to be 3, and then restart them.  It then at
least showed the right state on both.
>
> Next, let?s try and sync up the bricks:
>
> root at machine04:/# gluster volume sync machine00 all
> Sync volume may make data inaccessible while the sync is in progress. Do
you want to continue? (y/n) y
> volume sync: success
> root at machine04:/# gluster vol info
> No volumes present
>
> root at machine02:/# gluster volume heal g2home full
> Staging failed on machine04. Error: Volume g2home does not exist
>
> Think about that.  This is a replica 2 server, the entire point would be to
fix up the array if one of the machines was screwy.  heal seemed like the
command to fix it up.
>
> So, now that it is connected, let?s try this again:
>
> # gluster volume replace-brick g2home machine04:/.g/g2home
machine04:/.g/g2home-new commit force
> volume replace-brick: failed: Pre Validation failed on machine04. volume:
g2home does not exist
>
> Nope, that won?t work.  So, let?s try removing:
>
> # gluster vol remove-brick g2home replica 2 machine04:/.g/g2home
machine05:/.g/g2home start
> volume remove-brick start: failed: Staging failed on machine04. Please
check log file for details.
>
> Nope, that won?t either. What?s the point of remove, if it won?t work?
>
> Ok, fine, lets for for a bigger hammer:
>
> # gluster peer detach machine04 force
> peer detach: failed: Brick(s) with the peer machine04 exist in cluster
>
> Doh.  I know that, but, it is a replica!
>
> [ more googling ]
>
> Someone said to just copy the entire vols directory.  [ cross fingers ]
copy vols.
>
> Ok, I can now do a gluster volume status g2home detail, which I could not
before.  Files seem to be R/W on the array now.  I think that might have worked.
>
> So, why can?t gluster copy vols by itself, if indeed that is the right
thing to do?
Gluster should actually do that, provided the peer is in the 'Peer in
cluster' state.
>
> Why can?t the document say, just edit the state variable and just copy vols
to get it going again?
Which document did you refer to? I'm not aware of a document that
describes how to recover a peer after the loss of /var/lib/glusterd.

The following steps should have helped you get back the cluster into a
good state quickly.
On the newly reinstalled peer, before starting glusterd,
1. Create the /var/lib/glusterd/glusterd.info file and fill it with
the peer previous uuid and operating-version. The uuid can be obtained
from the peerinfo files in /var/lib/glusterd/peers on the other peers.
The operating-version from glusterd.info on the other peers.
2. From one of the other peers copy over /var/lib/glusterd/peers .
Remove the peerinfo file for this peer. This should allow glusterd on
this peer to accept connections from the rest of the cluster and also
connect to the rest of the cluster.
3. Start glusterd
4. The remaining information on volumes and other peers should be
synced over automatically, and the bricks and other daemons should
start running.

(We should probably put this down somewhere).
>
> Why can?t probe figure out that you were already part of a cluster, and
when it runs, it notices that your brains have been wiped, and just grab that
info from the cluster and bring the node back up?  It can even run heal on the
data to ensure that nothing messed with it and that it matches the other
replica.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

Gluster users - Feb 2016 - poor OS reinstalls with 3.7

[Gluster-users] poor OS reinstalls with 3.7

[Gluster-users] poor OS reinstalls with 3.7