I have a simple setup: gluster> volume info Volume Name: myvolume Type: Distributed-Replicate Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.2.218.188:/srv Brick2: 10.116.245.136:/srv Brick3: 10.206.38.103:/srv Brick4: 10.114.41.53:/srv Brick5: 10.68.73.41:/srv Brick6: 10.204.129.91:/srv I *killed* Brick #4 (kill -9 and then shut down instance). My intention is to simulate a catastrophic failure of Brick4 and replace it with a new server. I probed the new server, then ran the following command: gluster> peer probe 10.76.242.97 Probe successful gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv start replace-brick started successfully I waited a little while, saw no traffic on the new server and then ran this: gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv status It never returned. Now my cluster is in some weird state. It's still serving files, I still have a job copying files to it, but I am unable to replace the bad peer with a new one. root at ip-10-2-218-188:~# gluster volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv status replace-brick status unknown root at ip-10-2-218-188:~# gluster volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv abort replace-brick abort failed root at ip-10-2-218-188:~# gluster volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv start replace-brick failed to start How can I get my cluster back into a clean working state? Thanks, Bryan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/7c681d80/attachment.html>
I don't know if it will help, but I see the following in cli.log when I run replace-brick status/start: [2011-09-16 20:54:42.535212] W [rpc-transport.c:605:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket" [2011-09-16 20:54:43.880179] I [cli-rpc-ops.c:1188:gf_cli3_1_replace_brick_cbk] 0-cli: Received resp to replace brick [2011-09-16 20:54:43.880290] I [input.c:46:cli_batch] 0-: Exiting with: 1 On Fri, Sep 16, 2011 at 3:06 PM, Bryan Murphy <bmurphy1976 at gmail.com> wrote:> I have a simple setup: > > gluster> volume info > > Volume Name: myvolume > Type: Distributed-Replicate > Status: Started > Number of Bricks: 3 x 2 = 6 > Transport-type: tcp > Bricks: > Brick1: 10.2.218.188:/srv > Brick2: 10.116.245.136:/srv > Brick3: 10.206.38.103:/srv > Brick4: 10.114.41.53:/srv > Brick5: 10.68.73.41:/srv > Brick6: 10.204.129.91:/srv > > I *killed* Brick #4 (kill -9 and then shut down instance). > > My intention is to simulate a catastrophic failure of Brick4 and replace it > with a new server. > > I probed the new server, then ran the following command: > > gluster> peer probe 10.76.242.97 > Probe successful > > gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv > start > replace-brick started successfully > > I waited a little while, saw no traffic on the new server and then ran > this: > > gluster> volume replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv > status > > It never returned. Now my cluster is in some weird state. It's still > serving files, I still have a job copying files to it, but I am unable to > replace the bad peer with a new one. > > root at ip-10-2-218-188:~# gluster volume > replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv status > replace-brick status unknown > > root at ip-10-2-218-188:~# gluster volume > replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv abort > replace-brick abort failed > > root at ip-10-2-218-188:~# gluster volume > replace-brick myvolume 10.114.41.53:/srv 10.76.242.97:/srv start > replace-brick failed to start > > How can I get my cluster back into a clean working state? > > Thanks, > Bryan > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/ad91a6db/attachment.html>
Brian, Replace-brick is not what you need here. Replace-brick is designed to be used for a planned decommissioning of a node and migration of data from it to a new node. You can only use replace-brick when both the source and the target servers are up and running. If you have a catastrophic failure where one of the servers gets its OS disk completely wiped, you need to do this: http://europe.gluster.org/community/documentation/index.php/Gluster_3.2:_Brick_Restoration_-_Replace_Crashed_Server -- Vikas Gorur Engineer - Gluster -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110916/1cc6fe1e/attachment.html>