Greg Scott
2014-Sep-16 18:41 UTC
[Gluster-users] How do I temporarily take a brick out of service and then put it back later?
This seems like such an innocent question. I have a firewall system controlling tunnels all over the USA. It's an HA setup with two nodes. And I use Gluster to keep all the configs and logs replicated. It's an active/standby system and it's been in place for something like 3 years. The standby had a catastrophic hardware failure a while ago and it looks like it needs a new motherboard. We have people rebuilding the hardware. The standby hard drive seems fine. But now the primary system repeatedly stalls its I/Os, sometimes to directories that aren't even part of Glusterfs. And the problem is getting worse day by day, hour by hour. Before they barbecue me, how do I tell Gluster to temporarily take the failed node offline while the motherboard is replaced, then put it back in service and copy everything over to it? I don't want to completely remove the brick because when the hardware is repaired and we start it up again, I want it to join back up and have everything replicate over to it. So for now - what can I do on the surviving node to tell it not to try to replicate until further notice, and then how to I tell it to go back to normal when I get the standby system back online? Thanks - Greg Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140916/c72145c3/attachment.html>
Greg Scott
2014-Sep-16 18:56 UTC
[Gluster-users] How do I temporarily take a brick out of service and then put it back later?
This seems like such an innocent question. I have a firewall system controlling tunnels all over the USA. It's an HA setup with two nodes. And I built it with Gluster 3.2 to keep all the configs and logs replicated. It's an active/standby system and it's been in place for something like 3 years. The standby system had a catastrophic hardware failure a while ago and it looks like it needs a new motherboard. We have a new motherboard coming in a few days. The standby hard drive seems fine. But now the primary system repeatedly stalls its I/Os, sometimes to directories that aren't even part of Glusterfs. And the problem is getting worse day by day, hour by hour. Before they barbecue me, how do I tell Gluster to temporarily take the failed node offline while its motherboard is replaced, then put it back in service and copy everything over to it after it's repaired? I would rather not completely remove the brick because when the hardware is repaired and we start it up again, I want it to join back up and have everything replicate over to it. So for now - what can I do on the surviving node to tell it not to try to replicate until further notice, and then how to I tell it to go back to normal when I get the standby system back online? Here is an example of the stalled disk I/Os. These take 1-2 minutes or more before giving me any output. Sometimes I/Os just hang and never return. [root at lme-fw2 ipsec.d]# gluster help Traceback (most recent call last): File "/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py", line 19, in <module> import resource File "/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py", line 11, in <module> import tempfile File "/usr/lib64/python2.7/tempfile.py", line 18, in <module> """ KeyboardInterrupt ^C [root at lme-fw2 ipsec.d]# [root at lme-fw2 ipsec.d]# [root at lme-fw2 ipsec.d]# [root at lme-fw2 ipsec.d]# gluster help Traceback (most recent call last): File "/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py", line 19, in <module> import resource File "/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py", line 15, in <module> import repce File "/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/repce.py", line 1, in <module> import os KeyboardInterrupt ^C [root at lme-fw2 ipsec.d]# echo $0 -bash [root at lme-fw2 ipsec.d]# echo $? 0 [root at lme-fw2 ipsec.d]# gluster help Traceback (most recent call last): File "/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py", line 19, in <module> import resource File "/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py", line 17, in <module> from master import GMaster File "/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/master.py", line 1, in <module> import os KeyboardInterrupt volume info [all|<VOLNAME>] - list information of all volumes volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... - create a new volume of specified type with mentioned bricks volume delete <VOLNAME> - delete volume specified by <VOLNAME> volume start <VOLNAME> [force] - start volume specified by <VOLNAME> volume stop <VOLNAME> [force] - stop volume specified by <VOLNAME> volume add-brick <VOLNAME> <NEW-BRICK> ... - add brick to volume <VOLNAME> volume remove-brick <VOLNAME> <BRICK> ... - remove brick from volume <VOLNAME> volume rebalance <VOLNAME> [fix-layout|migrate-data] {start|stop|status} - rebalance operations volume replace-brick <VOLNAME> <BRICK> <NEW-BRICK> {start|pause|abort|status|commit} - replace-brick operations volume set <VOLNAME> <KEY> <VALUE> - set options for volume <VOLNAME> volume help - display help for the volume command volume log filename <VOLNAME> [BRICK] <PATH> - set the log file for corresponding volume/brick volume log locate <VOLNAME> [BRICK] - locate the log file for corresponding volume/brick volume log rotate <VOLNAME> [BRICK] - rotate the log file for corresponding volume/brick volume sync <HOSTNAME> [all|<VOLNAME>] - sync the volume information from a peer volume reset <VOLNAME> [force] - reset all the reconfigured options volume geo-replication [<VOLNAME>] [<SLAVE-URL>] {start|stop|config|status|log-rotate} [options...] - Geo-sync operations volume profile <VOLNAME> {start|info|stop} - volume profile operations volume quota <VOLNAME> <enable|disable|limit-usage|list|remove> [path] [value] - quota translator specific operations volume top <VOLNAME> {[open|read|write|opendir|readdir] |[read-perf|write-perf bs <size> count <count>]} [brick <brick>] [list-cnt <count>] - volume top operations peer probe <HOSTNAME> - probe peer specified by <HOSTNAME> peer detach <HOSTNAME> - detach peer specified by <HOSTNAME> peer status - list status of peers peer help - Help command for peer quit - quit help - display command options [root at lme-fw2 ipsec.d]# Thanks - Greg Scott
Eliezer Croitoru
2014-Sep-17 14:25 UTC
[Gluster-users] How do I temporarily take a brick out of service and then put it back later?
Before you rush into posting and reposting issue on the mailing list try first to: Get into glustefs irc channel at freenode.net and try to get live help. And it will not help you to spam the list.... You need to post the relevant information: glusterfs versions of server and client OS version df output etc.. I can still do not understand the basics about your setup.. Eliezer On 09/16/2014 09:41 PM, Greg Scott wrote:> This seems like such an innocent question. I have a firewall system > controlling tunnels all over the USA. It?s an HA setup with two nodes. > And I use Gluster to keep all the configs and logs replicated. > > It?s an active/standby system and it?s been in place for something like > 3 years. The standby had a catastrophic hardware failure a while ago > and it looks like it needs a new motherboard. We have people > rebuilding the hardware. The standby hard drive seems fine. > > But now the primary system repeatedly stalls its I/Os, sometimes to > directories that aren?t even part of Glusterfs. And the problem is > getting worse day by day, hour by hour. Before they barbecue me, how do > I tell Gluster to temporarily take the failed node offline while the > motherboard is replaced, then put it back in service and copy everything > over to it? I don?t want to completely remove the brick because when > the hardware is repaired and we start it up again, I want it to join > back up and have everything replicate over to it. > > So for now ? what can I do on the surviving node to tell it not to try > to replicate until further notice, and then how to I tell it to go back > to normal when I get the standby system back online? > > Thanks > > -Greg Scott > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users >