thr3ads.net - Gluster users - [Gluster-users] Replica 3 - how to replace failed node (peer) [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Martin Toth

2019-Apr-11 13:08 UTC

[Gluster-users] Replica 3 - how to replace failed node (peer)

Hi Karthik,
> On Thu, Apr 11, 2019 at 12:43 PM Martin Toth <snowmailer at gmail.com
<mailto:snowmailer at gmail.com>> wrote:
> Hi Karthik,
> 
> more over, I would like to ask if there are some recommended
settings/parameters for SHD in order to achieve good or fair I/O while volume
will be healed when I will replace Brick (this should trigger healing process).
> If I understand you concern correctly, you need to get fair I/O performance
for clients while healing takes place as part of  the replace brick operation.
For this you can turn off the "data-self-heal" and
"metadata-self-heal" options until the heal completes on the new
brick.
This is exactly what I mean. I am running VM disks on remaining 2 (out of 3 -
one failed as mentioned) nodes and I need to ensure there will be fair I/O
performance available on these two nodes while replace brick operation will heal
volume.
I will not run any VMs on node where replace brick operation will be running. So
if I understand correctly, when I will set :

# gluster volume set <volname> cluster.data-self-heal off
# gluster volume set <volname> cluster.metadata-self-heal off

this will tell Gluster clients (libgfapi and FUSE mount) not to read from node
?where replace brick operation? is in place but from remaing two healthy nodes.
Is this correct ? Thanks for clarification.
> Turning off client side healing doesn't compromise data integrity and
consistency. During the read request from client, pending xattr is evaluated for
replica copies and read is only served from correct copy. During writes, IO will
continue on both the replicas, SHD will take care of healing files.
> After replacing the brick, we strongly recommend you to consider upgrading
your gluster to one of the maintained versions. We have many stability related
fixes there, which can handle some critical issues and corner cases which you
could hit during these kind of scenarios.
This will be first priority in infrastructure after fixing this cluster back to
fully functional replica3. I will upgrade to 3.12.x and then to version 5 or 6.

BR, 
Martin
> Regards,
> Karthik
> I had some problems in past when healing was triggered, VM disks became
unresponsive because healing took most of I/O. My volume containing only big
files with VM disks.
> 
> Thanks for suggestions.
> BR, 
> Martin
> 
>> On 10 Apr 2019, at 12:38, Martin Toth <snowmailer at gmail.com
<mailto:snowmailer at gmail.com>> wrote:
>> 
>> Thanks, this looks ok to me, I will reset brick because I don't
have any data anymore on failed node so I can use same path / brick name.
>> 
>> Is reseting brick dangerous command? Should I be worried about some
possible failure that will impact remaining two nodes? I am running really old
3.7.6 but stable version.
>> 
>> Thanks,
>> BR!
>> 
>> Martin
>>  
>> 
>>> On 10 Apr 2019, at 12:20, Karthik Subrahmanya <ksubrahm at
redhat.com <mailto:ksubrahm at redhat.com>> wrote:
>>> 
>>> Hi Martin,
>>> 
>>> After you add the new disks and creating raid array, you can run
the following command to replace the old brick with new one:
>>> 
>>> - If you are going to use a different name to the new brick you can
run
>>> gluster volume replace-brick <volname> <old-brick>
<new-brick> commit force
>>> 
>>> - If you are planning to use the same name for the new brick as
well then you can use
>>> gluster volume reset-brick <volname> <old-brick>
<new-brick> commit force
>>> Here old-brick & new-brick's hostname &  path should be
same.
>>> 
>>> After replacing the brick, make sure the brick comes online using
volume status.
>>> Heal should automatically start, you can check the heal status to
see all the files gets replicated to the newly added brick. If it does not start
automatically, you can manually start that by running gluster volume heal
<volname>.
>>> 
>>> HTH,
>>> Karthik
>>> 
>>> On Wed, Apr 10, 2019 at 3:13 PM Martin Toth <snowmailer at
gmail.com <mailto:snowmailer at gmail.com>> wrote:
>>> Hi all,
>>> 
>>> I am running replica 3 gluster with 3 bricks. One of my servers
failed - all disks are showing errors and raid is in fault state.
>>> 
>>> Type: Replicate
>>> Volume ID: 41d5c283-3a74-4af8-a55d-924447bfa59a
>>> Status: Started
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: node1.san:/tank/gluster/gv0imagestore/brick1
>>> Brick2: node2.san:/tank/gluster/gv0imagestore/brick1 <? this
brick is down
>>> Brick3: node3.san:/tank/gluster/gv0imagestore/brick1
>>> 
>>> So one of my bricks is totally failed (node2). It went down and all
data are lost (failed raid on node2). Now I am running only two bricks on 2
servers out from 3.
>>> This is really critical problem for us, we can lost all data. I
want to add new disks to node2, create new raid array on them and try to replace
failed brick on this node.
>>> 
>>> What is the procedure of replacing Brick2 on node2, can someone
advice? I can?t find anything relevant in documentation.
>>> 
>>> Thanks in advance,
>>> Martin
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190411/3e0641f4/attachment.html>

Karthik Subrahmanya

2019-Apr-11 13:40 UTC

head link

[Gluster-users] Replica 3 - how to replace failed node (peer)

On Thu, Apr 11, 2019 at 6:38 PM Martin Toth <snowmailer at gmail.com>
wrote:
> Hi Karthik,
>
> On Thu, Apr 11, 2019 at 12:43 PM Martin Toth <snowmailer at
gmail.com> wrote:
>
>> Hi Karthik,
>>
>> more over, I would like to ask if there are some recommended
>> settings/parameters for SHD in order to achieve good or fair I/O while
>> volume will be healed when I will replace Brick (this should trigger
>> healing process).
>>
> If I understand you concern correctly, you need to get fair I/O
> performance for clients while healing takes place as part of  the replace
> brick operation. For this you can turn off the "data-self-heal"
and
> "metadata-self-heal" options until the heal completes on the new
brick.
>
>
> This is exactly what I mean. I am running VM disks on remaining 2 (out of
> 3 - one failed as mentioned) nodes and I need to ensure there will be fair
> I/O performance available on these two nodes while replace brick operation
> will heal volume.
> I will not run any VMs on node where replace brick operation will be
> running. So if I understand correctly, when I will set :
>
> # gluster volume set <volname> cluster.data-self-heal off
> # gluster volume set <volname> cluster.metadata-self-heal off
>
> this will tell Gluster clients (libgfapi and FUSE mount) not to read from
> node ?where replace brick operation? is in place but from remaing two
> healthy nodes. Is this correct ? Thanks for clarification.
>The reads will be served from one of the good bricks since the file will
either be not present on the replaced brick at the time of read or it will
be present but marked for heal if it is not already healed. If already
healed by SHD, then it could be served from the new brick as well, but
there won't be any problem in reading from there in that scenario.
By setting these two options whenever a read comes from client it will not
try to heal the file for data/metadata. Otherwise it would try to heal (if
not already healed by SHD) when the read comes on this, hence slowing down
the client.
>
> Turning off client side healing doesn't compromise data integrity and
> consistency. During the read request from client, pending xattr is
> evaluated for replica copies and read is only served from correct copy.
> During writes, IO will continue on both the replicas, SHD will take care of
> healing files.
> After replacing the brick, we strongly recommend you to consider upgrading
> your gluster to one of the maintained versions. We have many stability
> related fixes there, which can handle some critical issues and corner cases
> which you could hit during these kind of scenarios.
>
>
> This will be first priority in infrastructure after fixing this cluster
> back to fully functional replica3. I will upgrade to 3.12.x and then to
> version 5 or 6.
>Sounds good.

If you are planning to have the same name for the new brick and if you get
the error like "Brick may be containing or be contained by an existing
brick" even after using the force option, try  using a different name. That
should work.

Regards,
Karthik
>
> BR,
> Martin
>
> Regards,
> Karthik
>
>> I had some problems in past when healing was triggered, VM disks became
>> unresponsive because healing took most of I/O. My volume containing
only
>> big files with VM disks.
>>
>> Thanks for suggestions.
>> BR,
>> Martin
>>
>> On 10 Apr 2019, at 12:38, Martin Toth <snowmailer at gmail.com>
wrote:
>>
>> Thanks, this looks ok to me, I will reset brick because I don't
have any
>> data anymore on failed node so I can use same path / brick name.
>>
>> Is reseting brick dangerous command? Should I be worried about some
>> possible failure that will impact remaining two nodes? I am running
really
>> old 3.7.6 but stable version.
>>
>> Thanks,
>> BR!
>>
>> Martin
>>
>>
>> On 10 Apr 2019, at 12:20, Karthik Subrahmanya <ksubrahm at
redhat.com>
>> wrote:
>>
>> Hi Martin,
>>
>> After you add the new disks and creating raid array, you can run the
>> following command to replace the old brick with new one:
>>
>> - If you are going to use a different name to the new brick you can run
>> gluster volume replace-brick <volname> <old-brick>
<new-brick> commit
>> force
>>
>> - If you are planning to use the same name for the new brick as well
then
>> you can use
>> gluster volume reset-brick <volname> <old-brick>
<new-brick> commit force
>> Here old-brick & new-brick's hostname &  path should be
same.
>>
>> After replacing the brick, make sure the brick comes online using
volume
>> status.
>> Heal should automatically start, you can check the heal status to see
all
>> the files gets replicated to the newly added brick. If it does not
start
>> automatically, you can manually start that by running gluster volume
heal
>> <volname>.
>>
>> HTH,
>> Karthik
>>
>> On Wed, Apr 10, 2019 at 3:13 PM Martin Toth <snowmailer at
gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am running replica 3 gluster with 3 bricks. One of my servers
failed -
>>> all disks are showing errors and raid is in fault state.
>>>
>>> Type: Replicate
>>> Volume ID: 41d5c283-3a74-4af8-a55d-924447bfa59a
>>> Status: Started
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: node1.san:/tank/gluster/gv0imagestore/brick1
>>> Brick2: node2.san:/tank/gluster/gv0imagestore/brick1 <? this
brick is
>>> down
>>> Brick3: node3.san:/tank/gluster/gv0imagestore/brick1
>>>
>>> So one of my bricks is totally failed (node2). It went down and all
data
>>> are lost (failed raid on node2). Now I am running only two bricks
on 2
>>> servers out from 3.
>>> This is really critical problem for us, we can lost all data. I
want to
>>> add new disks to node2, create new raid array on them and try to
replace
>>> failed brick on this node.
>>>
>>> What is the procedure of replacing Brick2 on node2, can someone
advice?
>>> I can?t find anything relevant in documentation.
>>>
>>> Thanks in advance,
>>> Martin
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190411/b0bc33b6/attachment.html>

Gluster users - Apr 2019 - Replica 3 - how to replace failed node (peer)

[Gluster-users] Replica 3 - how to replace failed node (peer)

[Gluster-users] Replica 3 - how to replace failed node (peer)