Greg Scott
2012-Jan-22 11:00 UTC
[Gluster-users] Need to replace a brick on a failed first Gluster node
Hello - I am using Glusterfs 3.2.5-2. I have one very small replicated volume with 2 bricks, as follows: [root at lme-fw2 ~]# gluster volume info Volume Name: firewall-scripts Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.253.1:/gluster-fw1 Brick2: 192.168.253.2:/gluster-fw2 The application is a small active/standby HA appliance and I use the Gluster volume for config info. The Gluster nodes are also clients and there are no other clients. Fortunately for me, nothing is in production yet. My challenge is, the hard drive at 192.168.253.1 failed. This was the first Gluster node when I set everything up. I replaced its hard drive and am rebuilding it. I have a good copy of everything I care about in the 192.168.253.2 brick. My thought was, I could just remove the old 192.168.253.1 brick and replica, then gluster peer and add it all back again. But apparently not so simple: [root at lme-fw2 ~]# gluster volume remove-brick firewall-scripts 192.168.253.1:/gluster-fw1 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Incorrect brick 192.168.253.1:/gluster-fw1 for volume firewall-scripts Not particularly helpful diagnostic info. I also played around with gluster peer detach/attach, but now I think I may have created a mess: [root at lme-fw2 ~]# gluster peer probe 192.168.253.1 ^C [root at lme-fw2 ~]# gluster peer status Number of Peers: 1 Hostname: 192.168.253.1 Uuid: 00000000-0000-0000-0000-000000000000 State: Establishing Connection (Disconnected) [root at lme-fw2 ~]# Trying again: [root at lme-fw2 ~]# gluster peer detach 192.168.253.1 Detach successful [root at lme-fw2 ~]# gluster peer status No peers present [root at lme-fw2 ~]# gluster volume info Volume Name: firewall-scripts Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.253.1:/gluster-fw1 Brick2: 192.168.253.2:/gluster-fw2 [root at lme-fw2 ~]# gluster volume remove-brick firewall-scripts 192.168.253.1:/gluster-fw1 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Incorrect brick 192.168.253.1:/gluster-fw1 for volume firewall-scripts [root at lme-fw2 ~]# This should be simple and maybe I am missing something. On the fw2 Gluster node, I want to remove all trace of the old fw1 and then set up a new fw1 as a new replica. How do I get there from here? Also, once this goes into production, I will not have the luxury of taking everything offline and rebuilding it. What is the best way to recover from a hard drive failure on either node? Thanks - Greg Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120122/546e0c1c/attachment.html>
Giovanni Toraldo
2012-Jan-22 11:34 UTC
[Gluster-users] Need to replace a brick on a failed first Gluster node
Hi Greg, 2012/1/22 Greg Scott <GregScott at infrasupport.com>:> My challenge is, the hard drive at 192.168.253.1 failed. ?This was the first > Gluster node when I set everything up.? ?I replaced its hard drive and am > rebuilding it.? I have a good copy of everything I care about in the > 192.168.253.2 brick.? ?My thought was, I could just remove the old > 192.168.253.1 brick and replica, then gluster peer and add it all back > again.It's far more simple: if you retain the same hostname/ip address on the new machine, you need to make sure the new glusterd has the same UUID of the old dead one (there is a file in /etc/glusterd), configurations are automatically synced back at the first contact with the other active nodes. Instead, if you replace the node with a different node with different hostname / ip: http://community.gluster.org/q/a-replica-node-has-failed-completely-and-must-be-replaced-with-new-empty-hardware-how-do-i-add-the-new-hardware-and-bricks-back-into-the-replica-pair-and-begin-the-healing-process/ -- Giovanni Toraldo - LiberSoft http://www.libersoft.it