thr3ads.net - Gluster users - [Gluster-users] Unable to remove / replace faulty bricks [Jun 2013]

If this information is useful, please help other people find it:
Share via:

elvinas.piliponis at barclays.com

2013-Jun-18 06:13 UTC

[Gluster-users] Unable to remove / replace faulty bricks

Hello,

When trying to recover from failed node and replace brick with spare one I have
trashed my cluster and now it is in stuck state.
Any ideas, how to reintroduce/remove those nodes and bring peace and order to
cluster?

There was a pending brick replacement operation from 0031 to 0028 (it is still
not commited according to rbstate file)
There was a hardware failure on 0022 node

I was not able to commit replace brick 0031 due to 0022 was not responding and
not giving cluster lock to requesting node.
I was not able to start replacement 0022 to 0028 due to pending brick
replacement
I have forced peer removal from cluster, hoping that afterwards I would be able
to complete operations. Unfortunately I have removes not only 0022 but 0031
also.

I have peer probed 0031 successfully. But now gluster volume info and volume
status both lists 0031 node. But when I attempt to do a brick operation I do
get:

gluster volume remove-brick glustervmstore 0031:/mnt/vmstore/brick
0036:/mnt/vmstore/brick force
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Incorrect brick 0031:/mnt/vmstore/brick for volume glustervmstore

gluster volume replace-brick glustervmstore 0031:/mnt/vmstore/brick
0028:/mnt/vmstore/brick commit force
brick: 0031:/mnt/vmstore/brick does not exist in volume: glustervmstore

Same applies to 0022 which is listed in volume info but not on volume status.

Full cluster volume stop and start not attempted as there are some Terabytes of
information stored and actively used (Gluster used as VM storage)

Thank you
______________________________________________________________________________

Elvinas Piliponis I Virtualisation Engineering Engineer I Global Technology
Infrastructure and Services
Tel +370 5 251 1218, 7 2249 1218 I Mobile +370 656 69249 I Email
elvinas.piliponis at barclays.com<mailto:elvinas.piliponis at
barclays.com>
Barclays, GreenHall, Upes g. 21, Vilnius, Lithuania LT-081218
barclays.com

This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, or from the
transmission of any viruses. Replies to this e-mail may be
monitored by the Barclays Group for operational or business
reasons.

Any opinion or other information in this e-mail or its attachments
that does not relate to the business of the Barclays Group is
personal to the sender and is not given or endorsed by the Barclays
Group.

Barclays Bank PLC. Registered in England and Wales (registered no.
1026167).
Registered Office: 1 Churchill Place, London, E14 5HP, United
Kingdom.

Barclays Bank PLC is authorised by the Prudential Regulation
Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority (Financial Services Register No.
122702).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130618/91beb145/attachment.html>

Vijay Bellur

2013-Jun-18 11:32 UTC

head link

[Gluster-users] Unable to remove / replace faulty bricks

On 06/18/2013 11:43 AM, elvinas.piliponis at barclays.com
wrote:> Hello,
>
> When trying to recover from failed node and replace brick with spare one
> I have trashed my cluster and now it is in stuck state.
>
> Any ideas, how to reintroduce/remove those nodes and bring peace and
> order to cluster?
>
> There was a pending brick replacement operation from 0031 to 0028 (it is
> still not commited according to rbstate file)
>
> There was a hardware failure on 0022 node
>
> I was not able to commit replace brick 0031 due to 0022 was not
> responding and not giving cluster lock to requesting node.
>
> I was not able to start replacement 0022 to 0028  due to pending brick
> replacement
>
> I have forced peer removal from cluster, hoping that afterwards I would
> be able to complete operations. Unfortunately I have removes not only
> 0022 but 0031 also.
>
> I have peer probed 0031 successfully. But now gluster volume info and
> volume status both lists 0031 node. But when I attempt to do a brick
> operation I do get:
>
> gluster volume remove-brick glustervmstore 0031:/mnt/vmstore/brick
> 0036:/mnt/vmstore/brick force
>
> Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
>
> Incorrect brick 0031:/mnt/vmstore/brick for volume glustervmstore
>
> gluster volume replace-brick glustervmstore 0031:/mnt/vmstore/brick
> 0028:/mnt/vmstore/brick commit force
>
> brick: 0031:/mnt/vmstore/brick does not exist in volume: glustervmstore

Looks like these commands are being rejected from a node where the 
volume information is not current. Can you please provide glusterd logs 
from the node where these commands were issued?

Thanks,
Vijay

Anand Avati

2013-Jun-19 19:16 UTC

head link

[Gluster-users] Fwd: Unable to remove / replace faulty bricks

>
> Hello,
>
> I have managed to clear pending 0031 to 0028 operation by shutting down
> all the nodes , deleting rb_mount file and editing rb_state file.
> However this did not help reintroduce 00031 to the cluster (0022 also
> but it is offline so no chance to do peer probe).
>
> I have tried to replicate node removal and reattaching on other cluster
> and node did seem to be accepted after peer probe but due to no spare
> servers available for that cluster I was not able to do "brick
replace".
>
> In the gluster  config files I do not find anything that might indicate
> that node is not part of cluster:
> * Node is part of glustervmstore-client-24
> * Subvolume is defined in replica set glustervmstore-replicate-12
> * Replica set is defined as part of main volume.
> Everything looks like other replica sets.
>
> *** COMMAND:
> gluster volume replace-brick glustervmstore 00031:/mnt/vmstore/brick
> 00028:/mnt/vmstore/brick start
> brick: 00031:/mnt/vmstore/brick does not exist in volume: glustervmstore
>
> *** Log file /var/log/glusterfs/etc-glusterfs-glusterd.vol.log  extracts:
> On the missing node 00031
>
> [2013-06-18 12:45:09.328647] I [socket.c:1798:socket_event_handler]
> 0-transport: disconnecting now
> [2013-06-18 12:45:11.983650] I
> [glusterd-handler.c:502:glusterd_handle_cluster_lock] 0-glusterd:
> Received LOCK from uuid: 2d46fb6f-a36a-454a-b0ba-7df324746737
> [2013-06-18 12:45:11.983723] I [glusterd-utils.c:285:glusterd_lock]
> 0-glusterd: Cluster lock held by 2d46fb6f-a36a-454a-b0ba-7df324746737
> [2013-06-18 12:45:11.983793] I
> [glusterd-handler.c:1322:glusterd_op_lock_send_resp] 0-glusterd:
> Responded, ret: 0
> [2013-06-18 12:45:11.991438] I
> [glusterd-handler.c:1366:glusterd_handle_cluster_unlock] 0-glusterd:
> Received UNLOCK from uuid: 2d46fb6f-a36a-454a-b0ba-7df324746737
> [2013-06-18 12:45:11.991537] I
> [glusterd-handler.c:1342:glusterd_op_unlock_send_resp] 0-glusterd:
> Responded to unlock, ret: 0
> [2013-06-18 12:45:12.329047] I [socket.c:1798:socket_event_handler]
> 0-transport: disconnecting now
> [2013-06-18 12:45:15.329431] I [socket.c:1798:socket_event_handler]
> 0-transport: disconnecting now
>
> On the node I am attempting to do brick replace 00031 to 00028:
>
> [2013-06-18 12:45:11.982606] I
> [glusterd-replace-brick.c:98:glusterd_handle_replace_brick] 0-glusterd:
> Received replace brick req
> [2013-06-18 12:45:11.982691] I
> [glusterd-replace-brick.c:147:glusterd_handle_replace_brick] 0-glusterd:
> Received replace brick start request
> [2013-06-18 12:45:11.982754] I [glusterd-utils.c:285:glusterd_lock]
> 0-glusterd: Cluster lock held by 2d46fb6f-a36a-454a-b0ba-7df324746737
> [2013-06-18 12:45:11.982777] I
> [glusterd-handler.c:463:glusterd_op_txn_begin] 0-management: Acquired
> local lock
> [2013-06-18 12:45:11.984772] I
> [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd:
> Received ACC from uuid: f7860586-f92c-4114-8336-823c223f18c0
> ..... LOTS of ACC messages .....
> [2013-06-18 12:45:11.987076] I
> [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd:
> Received ACC from uuid: c49cfdbe-2af1-4050-bda1-bdd5fd3926b6
> [2013-06-18 12:45:11.987116] I
> [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd:
> Received ACC from uuid: 7e9e1cf3-214e-45c8-aa37-4da0def7fb6b
> [2013-06-18 12:45:11.987196] I
> [glusterd-utils.c:857:glusterd_volume_brickinfo_get_by_brick] 0-: brick:
> 00031:/mnt/vmstore/brick
> [2013-06-18 12:45:11.990732] E
> [glusterd-op-sm.c:1999:glusterd_op_ac_send_stage_op] 0-: Staging failed
> [2013-06-18 12:45:11.990785] I
> [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op
> req to 0 peers
> [2013-06-18 12:45:11.992356] I
> [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd:
> Received ACC from uuid: f0fcb6dd-c4ef-4751-b92e-db27ffd252d4
> [2013-06-18 12:45:11.992480] I
> [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd:
> Received ACC from uuid: 33c008a5-9c11-44d7-95c6-58362211bbe8
> ..... LOTS of ACC messages .....
> [2013-06-18 12:45:11.994447] I
> [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd:
> Received ACC from uuid: 444a54c6-d4f5-4407-905c-aef4e56e02be
> [2013-06-18 12:45:11.994483] I
> [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd:
> Received ACC from uuid: c49cfdbe-2af1-4050-bda1-bdd5fd3926b6
> [2013-06-18 12:45:11.994527] I
> [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd:
> Received ACC from uuid: 7e9e1cf3-214e-45c8-aa37-4da0def7fb6b
> [2013-06-18 12:45:11.994555] I
> [glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared
> local lock
> [2013-06-18 12:45:12.270020] I [socket.c:1798:socket_event_handler]
> 0-transport: disconnecting now
>
> My attempt manually delete affected replica sets from
> /var/lib/glusterd/vols/glustervmstore
>          glustervmstore-fuse.vol
>          info
>          trusted-glustervmstore-fuse.vol
> /var/lib/glusterd/glustershd
>          glustershd-server.vol
> files totally failed as glusterfs service failed to start at all
> complaining about unknown keys.
All volfiles are autogenerated based on the info available in the other 
files in /var/lib/glusterd/vols/<name>/ (like ./info, ./bricks/*). So to 
manually fix your "situation", please make sure the contents in the 
files ./info, ./node_state.info ./rbstate ./bricks/* are "proper" (you
can either share them with me offline, or compare them with another 
volume which is good), and issue a "gluster volume reset
<volname>" to
re-write fresh volfiles.

It is also a good idea to double check the contents of 
/var/lib/glusterd/peers/* is proper too.

Doing these manual steps and restarting all processes should recover you 
from pretty much any situation.

Back to the cause of the problem - it appears to be the case that the 
ongoing replace-brick got messed up when yet another server died.

A different way of achieving what you want, is to use add-brick + 
remove-brick for decommissioning servers (i.e, add-brick the new server 
- 00028, and "remove-brick start" the old one - 00031, and
"remove-brick
commit" once all the data has drained out). Moving forward this will be 
the recommended way to decommission servers. Use replace-brick to only 
replace an already dead server - 00022 with its replacement).

Let us know if the above steps took you back to a healthy state or if 
you faced further issues.

Avati
> I am using Semiosis 3.3.1 package on Ubuntu 12.04:
> dpkg -l | grep gluster
> rc  glusterfs                        3.3.0-1
>         clustered file-system
> ii  glusterfs-client                 3.3.1-ubuntu1~precise8
>          clustered file-system (client package)
> ii  glusterfs-common                 3.3.1-ubuntu1~precise8
>          GlusterFS common libraries and translator modules
> ii  glusterfs-server                 3.3.1-ubuntu1~precise8
>          clustered file-system (server package)
>
> Thank you
>
> -----Original Message-----
> From: Vijay Bellur [mailto:vbellur at redhat.com <mailto:vbellur at
redhat.com>]
> Sent: 18 June 2013 14:33
> To: Piliponis, Elvinas : RBB COO
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>
> Subject: Re: [Gluster-users] Unable to remove / replace faulty bricks
>
> On 06/18/2013 11:43 AM, elvinas.piliponis at barclays.com
> <mailto:elvinas.piliponis at barclays.com> wrote:
>  > Hello,
>  >
>  > When trying to recover from failed node and replace brick with spare
>  > one I have trashed my cluster and now it is in stuck state.
>  >
>  > Any ideas, how to reintroduce/remove those nodes and bring peace and
>  > order to cluster?
>  >
>  > There was a pending brick replacement operation from 0031 to 0028 (it
>  > is still not commited according to rbstate file)
>  >
>  > There was a hardware failure on 0022 node
>  >
>  > I was not able to commit replace brick 0031 due to 0022 was not
>  > responding and not giving cluster lock to requesting node.
>  >
>  > I was not able to start replacement 0022 to 0028  due to pending
brick
>  > replacement
>  >
>  > I have forced peer removal from cluster, hoping that afterwards I
>  > would be able to complete operations. Unfortunately I have removes
not
>  > only
>  > 0022 but 0031 also.
>  >
>  > I have peer probed 0031 successfully. But now gluster volume info and
>  > volume status both lists 0031 node. But when I attempt to do a brick
>  > operation I do get:
>  >
>  > gluster volume remove-brick glustervmstore 0031:/mnt/vmstore/brick
>  > 0036:/mnt/vmstore/brick force
>  >
>  > Removing brick(s) can result in data loss. Do you want to Continue?
>  > (y/n) y
>  >
>  > Incorrect brick 0031:/mnt/vmstore/brick for volume glustervmstore
>  >
>  > gluster volume replace-brick glustervmstore 0031:/mnt/vmstore/brick
>  > 0028:/mnt/vmstore/brick commit force
>  >
>  > brick: 0031:/mnt/vmstore/brick does not exist in volume:
>  > glustervmstore
>
>
> Looks like these commands are being rejected from a node where the
> volume information is not current. Can you please provide glusterd logs
> from the node where these commands were issued?
>
> Thanks,
> Vijay
>
>
> This e-mail and any attachments are confidential and intended
> solely for the addressee and may also be privileged or exempt from
> disclosure under applicable law. If you are not the addressee, or
> have received this e-mail in error, please notify the sender
> immediately, delete it from your system and do not copy, disclose
> or otherwise act upon any part of this e-mail or its attachments.
>
> Internet communications are not guaranteed to be secure or
> virus-free.
> The Barclays Group does not accept responsibility for any loss
> arising from unauthorised access to, or interference with, any
> Internet communications by any third party, or from the
> transmission of any viruses. Replies to this e-mail may be
> monitored by the Barclays Group for operational or business
> reasons.
>
> Any opinion or other information in this e-mail or its attachments
> that does not relate to the business of the Barclays Group is
> personal to the sender and is not given or endorsed by the Barclays
> Group.
>
> Barclays Bank PLC. Registered in England and Wales (registered no.
> 1026167).
> Registered Office: 1 Churchill Place, London, E14 5HP, United
> Kingdom.
>
> Barclays Bank PLC is authorised by the Prudential Regulation
> Authority and regulated by the Financial Conduct Authority and the
> Prudential Regulation Authority (Financial Services Register No.
> 122702).
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>

Gluster users - Jun 2013 - Unable to remove / replace faulty bricks

[Gluster-users] Unable to remove / replace faulty bricks

[Gluster-users] Unable to remove / replace faulty bricks

[Gluster-users] Fwd: Unable to remove / replace faulty bricks