Anup Nair
2013-Sep-03 07:48 UTC
[Gluster-users] How to remove a dead node and re-balance volume?
Glusterfs version 3.2.2 I have a Gluster volume in which one our of the 4 peers/nodes had crashed some time ago, prior to my joining service here. I see from volume info that the crashed (non-existing) node is still listed as one of the peers and the bricks are also listed. I would like to detach this node and its bricks and rebalance the volume with remaining 3 peers. But I am unable to do so. Here are my setps: 1. #gluster peer status Number of Peers: 3 -- (note: excluding the one I run this command from) Hostname: dbstore4r294 --- (note: node/peer that is down) Uuid: 8bf13458-1222-452c-81d3-565a563d768a State: Peer in Cluster (Disconnected) Hostname: 172.16.1.90 Uuid: 77ebd7e4-7960-4442-a4a4-00c5b99a61b4 State: Peer in Cluster (Connected) Hostname: dbstore3r294 Uuid: 23d7a18c-fe57-47a0-afbc-1e1a5305c0eb State: Peer in Cluster (Connected) 2. #gluster peer detach dbstore4r294 Brick(s) with the peer dbstore4r294 exist in cluster 3. #gluster volume info Volume Name: test-volume Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: dbstore1r293:/datastore1 Brick2: dbstore2r293:/datastore1 Brick3: dbstore3r294:/datastore1 Brick4: dbstore4r294:/datastore1 Brick5: dbstore1r293:/datastore2 Brick6: dbstore2r293:/datastore2 Brick7: dbstore3r294:/datastore2 Brick8: dbstore4r294:/datastore2 Options Reconfigured: network.ping-timeout: 42s performance.cache-size: 64MB performance.write-behind-window-size: 3MB performance.io-thread-count: 8 performance.cache-refresh-timeout: 2 Note that the non-existent node/peer is -- dbstore4r294 (bricks are :/datastore1 & /datastore2 - i.e. brick4 and brick8) 4. #gluster volume remove-brick test-volume dbstore4r294:/datastore1 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Remove brick incorrect brick count of 1 for replica 2 5. #gluster volume remove-brick test-volume dbstore4r294:/datastore1 dbstore4r294:/datastore2 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Bricks not from same subvol for replica How do I remove the peer? What are the steps considering that the node is non-existent? * Regards, * Anup Nair -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130903/4649bab6/attachment.html>
Vijay Bellur
2013-Sep-04 19:11 UTC
[Gluster-users] How to remove a dead node and re-balance volume?
On 09/03/2013 01:18 PM, Anup Nair wrote:> Glusterfs version 3.2.2 > > I have a Gluster volume in which one our of the 4 peers/nodes had > crashed some time ago, prior to my joining service here. > > I see from volume info that the crashed (non-existing) node is still > listed as one of the peers and the bricks are also listed. I would like > to detach this node and its bricks and rebalance the volume with > remaining 3 peers. But I am unable to do so. Here are my setps: > > 1. #gluster peer status > Number of Peers: 3 -- (note: excluding the one I run this command from) > > Hostname: dbstore4r294 --- (note: node/peer that is down) > Uuid: 8bf13458-1222-452c-81d3-565a563d768a > State: Peer in Cluster (Disconnected) > > Hostname: 172.16.1.90 > Uuid: 77ebd7e4-7960-4442-a4a4-00c5b99a61b4 > State: Peer in Cluster (Connected) > > Hostname: dbstore3r294 > Uuid: 23d7a18c-fe57-47a0-afbc-1e1a5305c0eb > State: Peer in Cluster (Connected) > > 2. #gluster peer detach dbstore4r294 > Brick(s) with the peer dbstore4r294 exist in cluster > > 3. #gluster volume info > > Volume Name: test-volume > Type: Distributed-Replicate > Status: Started > Number of Bricks: 4 x 2 = 8 > Transport-type: tcp > Bricks: > Brick1: dbstore1r293:/datastore1 > Brick2: dbstore2r293:/datastore1 > Brick3: dbstore3r294:/datastore1 > Brick4: dbstore4r294:/datastore1 > Brick5: dbstore1r293:/datastore2 > Brick6: dbstore2r293:/datastore2 > Brick7: dbstore3r294:/datastore2 > Brick8: dbstore4r294:/datastore2 > Options Reconfigured: > network.ping-timeout: 42s > performance.cache-size: 64MB > performance.write-behind-window-size: 3MB > performance.io-thread-count: 8 > performance.cache-refresh-timeout: 2 > > Note that the non-existent node/peer is -- dbstore4r294 (bricks are > :/datastore1 & /datastore2 - i.e. brick4 and brick8) > > 4. #gluster volume remove-brick test-volume dbstore4r294:/datastore1 > Removing brick(s) can result in data loss. Do you want to Continue? > (y/n) y > Remove brick incorrect brick count of 1 for replica 2 > > 5. #gluster volume remove-brick test-volume dbstore4r294:/datastore1 > dbstore4r294:/datastore2 > Removing brick(s) can result in data loss. Do you want to Continue? > (y/n) y > Bricks not from same subvol for replica > > How do I remove the peer? What are the steps considering that the node > is non-existent? > */Do you plan to replace the dead server with a new server? If so, this could be a possible sequence of steps: 1. Peer probe new server and have two bricks commited 2. volume replace-brick <volname> <brick4> <new-brick1> commit force 3. volume replace-brick <volname> <brick8> <new-brick2> commit force 4. peer detach dead server. 5. Since 3.2.2 is being used here, you would need a crawl (find . | xargs stat) to trigger self-healing for the newly added bricks. -Vijay