thr3ads.net - Gluster users - [Gluster-users] How do I temporarily take a brick out of service and then put it back later? [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Greg Scott

2014-Sep-16 18:41 UTC

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?

This seems like such an innocent question.  I have a firewall system controlling
tunnels all over the USA.  It's an HA setup with two nodes.  And I use
Gluster to keep all the configs and logs replicated.

It's an active/standby system and it's been in place for something like
3 years.  The standby had a catastrophic hardware failure a while ago and it
looks like it needs a new motherboard.   We have people rebuilding the hardware.
The standby hard drive seems fine.

But now the primary system repeatedly stalls its I/Os, sometimes to directories
that aren't even part of Glusterfs.  And the problem is getting worse day by
day, hour by hour.  Before they barbecue me, how do I tell Gluster to
temporarily take the failed node offline while the motherboard is replaced, then
put it back in service and copy everything over to it?  I don't want to
completely remove the brick because when the hardware is repaired and we start
it up again, I want it to join back up and have everything replicate over to it.

So for now - what can I do on the surviving node to tell it not to try to
replicate until further notice, and then how to I tell it to go back to normal
when I get the standby system back online?

Thanks


-          Greg Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140916/c72145c3/attachment.html>

Greg Scott

2014-Sep-16 18:56 UTC

head link

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?

This seems like such an innocent question.  I have a firewall system controlling
tunnels all over the USA.  It's an HA setup with two nodes.  And I built it
with Gluster 3.2 to keep all the configs and logs replicated.

It's an active/standby system and it's been in place for something like
3 years.  The standby system had a catastrophic hardware failure a while ago and
it looks like it needs a new motherboard.   We have a new motherboard coming in
a few days.  The standby hard drive seems fine.

But now the primary system repeatedly stalls its I/Os, sometimes to directories
that aren't even part of Glusterfs.  And the problem is getting worse day by
day, hour by hour.  Before they barbecue me, how do I tell Gluster to
temporarily take the failed node offline while its motherboard is replaced, then
put it back in service and copy everything over to it after it's repaired? 
I would rather not completely remove the brick because when the hardware is
repaired and we start it up again, I want it to join back up and have everything
replicate over to it.

So for now - what can I do on the surviving node to tell it not to try to
replicate until further notice, and then how to I tell it to go back to normal
when I get the standby system back online?

Here is an example of the stalled disk I/Os.  These take 1-2 minutes or more
before giving me any output.  Sometimes I/Os just hang and never return.

[root at lme-fw2 ipsec.d]# gluster help
Traceback (most recent call last):
  File
"/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py",
line 19, in <module>
    import resource
  File
"/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py",
line 11, in <module>
    import tempfile
  File "/usr/lib64/python2.7/tempfile.py", line 18, in <module>
    """
KeyboardInterrupt
^C
[root at lme-fw2 ipsec.d]#
[root at lme-fw2 ipsec.d]#
[root at lme-fw2 ipsec.d]#
[root at lme-fw2 ipsec.d]# gluster help
Traceback (most recent call last):
  File
"/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py",
line 19, in <module>
    import resource
  File
"/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py",
line 15, in <module>
    import repce
  File
"/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/repce.py",
line 1, in <module>
    import os
KeyboardInterrupt

^C
[root at lme-fw2 ipsec.d]# echo $0
-bash
[root at lme-fw2 ipsec.d]# echo $?
0
[root at lme-fw2 ipsec.d]# gluster help
Traceback (most recent call last):
  File
"/opt/glusterfs/3.2.5/local/libexec//glusterfs/python/syncdaemon/gsyncd.py",
line 19, in <module>
    import resource
  File
"/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/resource.py",
line 17, in <module>
    from master import GMaster
  File
"/opt/glusterfs/3.2.5/local/libexec/glusterfs/python/syncdaemon/master.py",
line 1, in <module>
    import os
KeyboardInterrupt
volume info [all|<VOLNAME>] - list information of all volumes
volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>]
[transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... - create a new
volume of specified type with mentioned bricks
volume delete <VOLNAME> - delete volume specified by <VOLNAME>
volume start <VOLNAME> [force] - start volume specified by <VOLNAME>
volume stop <VOLNAME> [force] - stop volume specified by <VOLNAME>
volume add-brick <VOLNAME> <NEW-BRICK> ... - add brick to volume
<VOLNAME>
volume remove-brick <VOLNAME> <BRICK> ... - remove brick from volume
<VOLNAME>
volume rebalance <VOLNAME> [fix-layout|migrate-data] {start|stop|status} -
rebalance operations
volume replace-brick <VOLNAME> <BRICK> <NEW-BRICK>
{start|pause|abort|status|commit} - replace-brick operations
volume set <VOLNAME> <KEY> <VALUE> - set options for volume
<VOLNAME>
volume help - display help for the volume command
volume log filename <VOLNAME> [BRICK] <PATH> - set the log file for
corresponding volume/brick
volume log locate <VOLNAME> [BRICK] - locate the log file for
corresponding volume/brick
volume log rotate <VOLNAME> [BRICK] - rotate the log file for
corresponding volume/brick
volume sync <HOSTNAME> [all|<VOLNAME>] - sync the volume information
from a peer
volume reset <VOLNAME> [force] - reset all the reconfigured options
volume geo-replication [<VOLNAME>] [<SLAVE-URL>]
{start|stop|config|status|log-rotate} [options...] - Geo-sync operations
volume profile <VOLNAME> {start|info|stop} - volume profile operations
volume quota <VOLNAME> <enable|disable|limit-usage|list|remove>
[path] [value] - quota translator specific operations
volume top <VOLNAME> {[open|read|write|opendir|readdir]
|[read-perf|write-perf bs <size> count <count>]}  [brick
<brick>] [list-cnt <count>] - volume top operations
peer probe <HOSTNAME> - probe peer specified by <HOSTNAME>
peer detach <HOSTNAME> - detach peer specified by <HOSTNAME>
peer status - list status of peers
peer help - Help command for peer
quit - quit
help - display command options
[root at lme-fw2 ipsec.d]#

Thanks

- Greg Scott

Eliezer Croitoru

2014-Sep-17 14:25 UTC

head link

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?

Before you rush into posting and reposting issue on the mailing list try 
first to:
Get into glustefs irc channel at freenode.net and try to get live help.
And it will not help you to spam the list....
You need to post the relevant information:
glusterfs versions of server and client
OS version
df output
etc..

I can still do not understand the basics about your setup..

Eliezer

On 09/16/2014 09:41 PM, Greg Scott wrote:> This seems like such an innocent question.  I have a firewall system
> controlling tunnels all over the USA.  It?s an HA setup with two nodes.
> And I use Gluster to keep all the configs and logs replicated.
>
> It?s an active/standby system and it?s been in place for something like
> 3 years.  The standby had a catastrophic hardware failure a while ago
> and it looks like it needs a new motherboard.   We have people
> rebuilding the hardware.  The standby hard drive seems fine.
>
> But now the primary system repeatedly stalls its I/Os, sometimes to
> directories that aren?t even part of Glusterfs.  And the problem is
> getting worse day by day, hour by hour.  Before they barbecue me, how do
> I tell Gluster to temporarily take the failed node offline while the
> motherboard is replaced, then put it back in service and copy everything
> over to it?  I don?t want to completely remove the brick because when
> the hardware is repaired and we start it up again, I want it to join
> back up and have everything replicate over to it.
>
> So for now ? what can I do on the surviving node to tell it not to try
> to replicate until further notice, and then how to I tell it to go back
> to normal when I get the standby system back online?
>
> Thanks
>
> -Greg Scott
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>

Gluster users - Sep 2014 - How do I temporarily take a brick out of service and then put it back later?

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?

[Gluster-users] How do I temporarily take a brick out of service and then put it back later?