thr3ads.net - Gluster users - [Gluster-users] Replace-brick on 3.3.1 hangs entire volume for several minutes and then hangs glusterfs on destination brick [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Hans Lambermont

2013-Apr-08 14:17 UTC

[Gluster-users] Replace-brick on 3.3.1 hangs entire volume for several minutes and then hangs glusterfs on destination brick

Hi gluster users,

I just upgraded 3.2.5 to 3.3.1 for a Distributed-Replicate volume with
about 2M directories to get a working replace-brick and now see it hang
up the entire gluster volume for all clients for several minutes, and
subsequently hang up the glusterfs on the destination brick.

I suspect the gluster volume hangup to be related to
https://bugzilla.redhat.com/show_bug.cgi?id=832609 "Glusterfsd hangs if
brick filesystem becomes unresponsive, causing all clients to lock up".

The resulting hanging destination replace-brick sits at 100% CPU and
shows no strace output.

gluster volume replace-brick xxx status
Number of files migrated = 3       Current file= /xxx 

%CPU %MEM    TIME+  P COMMAND
100  0.2   2238:48 2 //sbin/glusterfs
-f/var/lib/glusterd/vols/vol01/rb_dst_brick.vol ...

The target brick received about 1% of the intended directories.

The log file -etc-glusterfs-glusterd.vol.log shows only that the
replace-brick has started :

I [glusterd-replace-brick.c:98:glusterd_handle_replace_brick] 0-glusterd:
Received replace brick req
I [glusterd-replace-brick.c:147:glusterd_handle_replace_brick] 0-glusterd:
Received replace brick status request
I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by 3*
I [glusterd-handler.c:463:glusterd_op_txn_begin] 0-management: Acquired local
lock
I [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC
from uuid: 9*
I [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC
from uuid: c*
I [glusterd-utils.c:857:glusterd_volume_brickinfo_get_by_brick] 0-: brick:
s1:/g/c
I [glusterd-utils.c:814:glusterd_volume_brickinfo_get] 0-management: Found brick
I [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req
to 2 peers
I [glusterd-rpc-ops.c:881:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC
from uuid: c*
I [glusterd-rpc-ops.c:881:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC
from uuid: 9*
I [glusterd-utils.c:857:glusterd_volume_brickinfo_get_by_brick] 0-: brick:
s1:/g/c
I [glusterd-utils.c:814:glusterd_volume_brickinfo_get] 0-management: Found brick
I [glusterd-replace-brick.c:1288:rb_update_dstbrick_port] 0-: adding dst-brick
port no
I [glusterd-op-sm.c:2384:glusterd_op_ac_send_commit_op] 0-management: Sent op
req to 2 peers
I [glusterd-rpc-ops.c:1317:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC
from uuid: c*
I [glusterd-rpc-ops.c:1317:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC
from uuid: 9*
I [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received
ACC from uuid: 9*
I [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received
ACC from uuid: c*
I [glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared local
lock

Any hints on how to proceed from here and get replace-brick to work are welcome.

regards,
   Hans Lambermont
-- 
Hans Lambermont | Senior Architect
(t) +31407370104 (w) www.shapeways.com

Hans Lambermont

2013-Apr-09 12:31 UTC

head link

[Gluster-users] 'replace-brick' - why we plan to deprecate

Amar Tumballi wrote on Thu Oct 11 18:35:32 UTC 2012 :
> When we initially came up with specs of 'glusterd', we needed an
> option to replace a dead brick, and few people even requested for
> having an option to migrate the data from the brick, when we are
> replacing it.
Do you specifically mean a 'dead' brick ? Or does your proposal hold for
a live brick too ?

(And doesn't a dead brick prevent one from reading data from it ?)
> The result of this is 'gluster volume replace-brick' CLI, and in
the
> releases till 3.3.0 this was the only way to 'migrate' data off a
> removed brick properly.
> Now, with 3.3.0+ (ie, in upstream too), we have another *better*
> approach (technically), which is achieved by below methods:
...
> 2) (Distributed-)Replicate Volume:
> 
> earlier:
> #gluster volume replace-brick <VOL> brick1 brick2 start [1]
> 
> now:
> 
> #gluster volume replace-brick <VOL> brick1 brick2 commit force
> (self-heal daemon takes care of syncing data from one brick to another)
For a dead brick this is OK. For a live brick this would break
redundancy during the long syncing time which is unacceptable.


What is the live brick replace-brick 3.3.1 status and roadmap ?


regards,
   Hans Lambermont
-- 
Hans Lambermont | Senior Architect
(t) +31407370104 (w) www.shapeways.com

Gluster users - Apr 2013 - Replace-brick on 3.3.1 hangs entire volume for several minutes and then hangs glusterfs on destination brick

[Gluster-users] Replace-brick on 3.3.1 hangs entire volume for several minutes and then hangs glusterfs on destination brick

[Gluster-users] 'replace-brick' - why we plan to deprecate