John Gardeniers
2014-Jun-24 22:59 UTC
[Gluster-users] Advise on recovering from a bad replica please
Hi All, We're using Gluster as the storage for our virtualization. This consists of 2 servers with a single brick each configured as a replica pair. We also have a geo-replica on one of those two servers. For reasons that don't really matter, last weekend we had a situation which cause one server to reboot a number of times, which in turn resulted in a lot of heal-failed and split-brain errors. Because at the same time VMs were being migrated across hosts we ended up with many crashed VMs. Due to the need get the VMs up and running with as quickly as possible we decided to shut down one Gluster replica and use the "primary" one alone. As the geo-replica is also on the node we shut down that leaves us with just a single copy, which makes us rather nervous. As we have decided to treat the files on the currently running node as "correct", I'd appreciate advise on the best way to get the other node back into the replication. Should we simply bring it back on line and try to correct the errors that I expect will be many or should we treat it as a failed server and bring it back with an empty brick, rather than what is currently in the existing brick? The volume/bricks are 5TB, of which we're currently using around 2TB and the servers are on a 10Gb network, so I imagine it shouldn't take too long to rebuild and this would all be done out of hours anyway. regards, John
Pranith Kumar Karampuri
2014-Jun-25 09:05 UTC
[Gluster-users] Advise on recovering from a bad replica please
On 06/25/2014 04:29 AM, John Gardeniers wrote:> Hi All, > > We're using Gluster as the storage for our virtualization. This consists > of 2 servers with a single brick each configured as a replica pair. We > also have a geo-replica on one of those two servers. > > For reasons that don't really matter, last weekend we had a situation > which cause one server to reboot a number of times, which in turn > resulted in a lot of heal-failed and split-brain errors. Because at the > same time VMs were being migrated across hosts we ended up with many > crashed VMs. > > Due to the need get the VMs up and running with as quickly as possible > we decided to shut down one Gluster replica and use the "primary" one > alone. As the geo-replica is also on the node we shut down that leaves > us with just a single copy, which makes us rather nervous. > > As we have decided to treat the files on the currently running node as > "correct", I'd appreciate advise on the best way to get the other node > back into the replication. Should we simply bring it back on line and > try to correct the errors that I expect will be many or should we treat > it as a failed server and bring it back with an empty brick, rather than > what is currently in the existing brick? The volume/bricks are 5TB, of > which we're currently using around 2TB and the servers are on a 10Gb > network, so I imagine it shouldn't take too long to rebuild and this > would all be done out of hours anyway.Considering you are saying there were split-brain related errors as well. I suggest you bring up empty brick. Could you give "gluster volume info" output and tell me which brick went down. Based on that I will tell you what you need to do. Pranith> > regards, > John > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users