thr3ads.net - Gluster users - [Gluster-users] Can't replace dead peer/brick [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Bryan Murphy

2011-Sep-16 20:06 UTC

[Gluster-users] Can't replace dead peer/brick

I have a simple setup:

gluster> volume info

Volume Name: myvolume
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.2.218.188:/srv
Brick2: 10.116.245.136:/srv
Brick3: 10.206.38.103:/srv
Brick4: 10.114.41.53:/srv
Brick5: 10.68.73.41:/srv
Brick6: 10.204.129.91:/srv

I *killed* Brick #4 (kill -9 and then shut down instance).

My intention is to simulate a catastrophic failure of Brick4 and replace it
with a new server.

I probed the new server, then ran the following command:

gluster> peer probe 10.76.242.97
Probe successful

gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv
start
replace-brick started successfully

I waited a little while, saw no traffic on the new server and then ran this:

gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv
status

It never returned.  Now my cluster is in some weird state.  It's still
serving files, I still have a job copying files to it, but I am unable to
replace the bad peer with a new one.

root at ip-10-2-218-188:~# gluster volume replace-brick myvolume
10.114.41.53:/srv
10.76.242.97:/srv status
replace-brick status unknown

root at ip-10-2-218-188:~# gluster volume replace-brick myvolume
10.114.41.53:/srv
10.76.242.97:/srv abort
replace-brick abort failed

root at ip-10-2-218-188:~# gluster volume replace-brick myvolume
10.114.41.53:/srv
10.76.242.97:/srv start
replace-brick failed to start

How can I get my cluster back into a clean working state?

Thanks,
Bryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/7c681d80/attachment.html>

Bryan Murphy

2011-Sep-16 20:55 UTC

head link

[Gluster-users] Can't replace dead peer/brick

I don't know if it will help, but I see the following in cli.log when I run
replace-brick status/start:

[2011-09-16 20:54:42.535212] W [rpc-transport.c:605:rpc_transport_load]
0-rpc-transport: missing 'option transport-type'. defaulting to
"socket"
[2011-09-16 20:54:43.880179] I
[cli-rpc-ops.c:1188:gf_cli3_1_replace_brick_cbk] 0-cli: Received resp to
replace brick
[2011-09-16 20:54:43.880290] I [input.c:46:cli_batch] 0-: Exiting with: 1

On Fri, Sep 16, 2011 at 3:06 PM, Bryan Murphy <bmurphy1976 at gmail.com>
wrote:
> I have a simple setup:
>
> gluster> volume info
>
> Volume Name: myvolume
> Type: Distributed-Replicate
> Status: Started
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.2.218.188:/srv
> Brick2: 10.116.245.136:/srv
> Brick3: 10.206.38.103:/srv
> Brick4: 10.114.41.53:/srv
> Brick5: 10.68.73.41:/srv
> Brick6: 10.204.129.91:/srv
>
> I *killed* Brick #4 (kill -9 and then shut down instance).
>
> My intention is to simulate a catastrophic failure of Brick4 and replace it
> with a new server.
>
> I probed the new server, then ran the following command:
>
> gluster> peer probe 10.76.242.97
> Probe successful
>
> gluster> volume replace-brick myvolume 10.114.41.53:/srv
10.76.242.97:/srv
> start
> replace-brick started successfully
>
> I waited a little while, saw no traffic on the new server and then ran
> this:
>
> gluster> volume replace-brick myvolume 10.114.41.53:/srv
10.76.242.97:/srv
> status
>
> It never returned.  Now my cluster is in some weird state.  It's still
> serving files, I still have a job copying files to it, but I am unable to
> replace the bad peer with a new one.
>
> root at ip-10-2-218-188:~# gluster volume
> replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv status
> replace-brick status unknown
>
> root at ip-10-2-218-188:~# gluster volume
> replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv abort
> replace-brick abort failed
>
> root at ip-10-2-218-188:~# gluster volume
> replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv start
> replace-brick failed to start
>
> How can I get my cluster back into a clean working state?
>
> Thanks,
> Bryan
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/ad91a6db/attachment.html>

Vikas Gorur

2011-Sep-16 21:02 UTC

head link

[Gluster-users] Can't replace dead peer/brick

Brian,

Replace-brick is not what you need here. Replace-brick is designed to be
used for a planned decommissioning of a node and migration of data from it
to a new node. You can only use replace-brick when both the source and the
target servers are up and running.

If you have a catastrophic failure where one of the servers gets its OS disk
completely wiped, you need to do this:

http://europe.gluster.org/community/documentation/index.php/Gluster_3.2:_Brick_Restoration_-_Replace_Crashed_Server



-- 
Vikas Gorur
Engineer - Gluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/1cc6fe1e/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

Gluster users - Sep 2011 - Can't replace dead peer/brick

[Gluster-users] Can't replace dead peer/brick

[Gluster-users] Can't replace dead peer/brick

[Gluster-users] Can't replace dead peer/brick

Seemingly Similar Threads