thr3ads.net - Gluster users - [Gluster-users] Disperse volume recovery and healing [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Victor T

2018-Mar-16 03:57 UTC

[Gluster-users] Disperse volume recovery and healing

Xavi, does that mean that even if every node was rebooted one at a time even
without issuing a heal that the volume would have no issues after running
gluster volume heal [volname] when all bricks are back online?

________________________________
From: Xavi Hernandez <jahernan at redhat.com>
Sent: Thursday, March 15, 2018 12:09:05 AM
To: Victor T
Cc: gluster-users at gluster.org
Subject: Re: [Gluster-users] Disperse volume recovery and healing

Hi Victor,

On Wed, Mar 14, 2018 at 12:30 AM, Victor T <hero_of_nothing_1 at
hotmail.com<mailto:hero_of_nothing_1 at hotmail.com>> wrote:

I have a question about how disperse volumes handle brick failure. I'm
running version 3.10.10 on all systems. If I have a disperse volume in a 4+2
configuration with 6 servers each serving 1 brick, and maintenance needs to be
performed on all systems, are there any general steps that need to be taken to
ensure data is not lost or service interrupted? For example, can I just reboot
each system sequentially after making sure sure the service is running on all
servers before rebooting the next system? Or is there a need to force/wait for a
heal after each brick comes back online? If I have two bricks down for multiple
days and then bring them back in, is there a need to issue a heal or something
like a rebalance before rebooting the other servers? There's lots of
documentation about other volume types, but it seems information specific to
dispersed volumes is a bit hard to find. Thanks a bunch.

On a 4+2 configuration you could bring down up to 2 bricks simultaneously for
maintenance. However if something happens to one of the remaining 4 bricks, the
volume would stop working. So in this case I would recommend to not have more
than one server down for maintenance at the same time unless the down time is
very very small.

Once the stopped servers come back up again, you need to wait until all files
are healed before proceeding with the next server. Failing to do so means that
some files could have more than 2 non-healthy versions, what will make the file
inaccessible until enough healthy versions are available again.

Self-heal should be automatically triggered once the bricks come online, however
there was a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1547662) that could
cause delays in the self-heal process. This bug should be fixed in the next
version. Meantime you can force self-heal to progress by issuing "gluster
volume heal <volname>" commands each time it seems to have stopped.

Once the output of "gluster volume heal <volname> info" reports
0 pending files on all bricks, you can proceed with the maintenance of the next
server.

No need to do any rebalance for down bricks. Rebalance is basically needed when
volume is expanded with more bricks.

Xavi

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180316/f74f081e/attachment.html>

Xavi Hernandez

2018-Mar-16 06:46 UTC

head link

[Gluster-users] Disperse volume recovery and healing

On Fri, Mar 16, 2018 at 4:57 AM, Victor T <hero_of_nothing_1 at
hotmail.com>
wrote:
> Xavi, does that mean that even if every node was rebooted one at a time
> even without issuing a heal that the volume would have no issues after
> running gluster volume heal [volname] when all bricks are back online?
>
No. After bringing up one brick and before stopping the next one, you need
to be sure that there are no damaged files. You shouldn't reboot a node if
"gluster volume heal <volname> info" shows damaged files.

The command "gluster volume heal <volname>" is only a tool to
force heal to
progress (until the bug is fixed).

Xavi

>
> ------------------------------
> *From:* Xavi Hernandez <jahernan at redhat.com>
> *Sent:* Thursday, March 15, 2018 12:09:05 AM
> *To:* Victor T
> *Cc:* gluster-users at gluster.org
> *Subject:* Re: [Gluster-users] Disperse volume recovery and healing
>
> Hi Victor,
>
> On Wed, Mar 14, 2018 at 12:30 AM, Victor T <hero_of_nothing_1 at
hotmail.com>
> wrote:
>
> I have a question about how disperse volumes handle brick failure. I'm
> running version 3.10.10 on all systems. If I have a disperse volume in a
> 4+2 configuration with 6 servers each serving 1 brick, and maintenance
> needs to be performed on all systems, are there any general steps that need
> to be taken to ensure data is not lost or service interrupted? For example,
> can I just reboot each system sequentially after making sure sure the
> service is running on all servers before rebooting the next system? Or is
> there a need to force/wait for a heal after each brick comes back online?
> If I have two bricks down for multiple days and then bring them back in, is
> there a need to issue a heal or something like a rebalance before rebooting
> the other servers? There's lots of documentation about other volume
types,
> but it seems information specific to dispersed volumes is a bit hard to
> find. Thanks a bunch.
>
>
> On a 4+2 configuration you could bring down up to 2 bricks simultaneously
> for maintenance. However if something happens to one of the remaining 4
> bricks, the volume would stop working. So in this case I would recommend to
> not have more than one server down for maintenance at the same time unless
> the down time is very very small.
>
> Once the stopped servers come back up again, you need to wait until all
> files are healed before proceeding with the next server. Failing to do so
> means that some files could have more than 2 non-healthy versions, what
> will make the file inaccessible until enough healthy versions are available
> again.
>
> Self-heal should be automatically triggered once the bricks come online,
> however there was a bug (https://bugzilla.redhat.com/
> show_bug.cgi?id=1547662) that could cause delays in the self-heal
> process. This bug should be fixed in the next version. Meantime you can
> force self-heal to progress by issuing "gluster volume heal
<volname>"
> commands each time it seems to have stopped.
>
> Once the output of "gluster volume heal <volname> info"
reports 0 pending
> files on all bricks, you can proceed with the maintenance of the next
> server.
>
> No need to do any rebalance for down bricks. Rebalance is basically needed
> when volume is expanded with more bricks.
>
> Xavi
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180316/619901da/attachment.html>

Victor T

2018-Mar-18 02:47 UTC

head link

[Gluster-users] Disperse volume recovery and healing

No. After bringing up one brick and before stopping the next one, you need to be
sure that there are no damaged files. You shouldn't reboot a node if
"gluster volume heal <volname> info" shows damaged files.

What happens in this case then? I'm thinking about a situation where the
servers are kept in an environment that we don't control - i.e. the cloud.
If the VMs are forcibly rebooted without enough time to complete a heal before
the next one goes down, then it cannot be guaranteed that the data is safe? This
has happened to me with Azure before, during the Meltdown/Spectre incident.
________________________________
From: Xavi Hernandez <jahernan at redhat.com>
Sent: Thursday, March 15, 2018 11:46:52 PM
To: Victor T
Cc: gluster-users at gluster.org
Subject: Re: [Gluster-users] Disperse volume recovery and healing

On Fri, Mar 16, 2018 at 4:57 AM, Victor T <hero_of_nothing_1 at
hotmail.com<mailto:hero_of_nothing_1 at hotmail.com>> wrote:

Xavi, does that mean that even if every node was rebooted one at a time even
without issuing a heal that the volume would have no issues after running
gluster volume heal [volname] when all bricks are back online?

No. After bringing up one brick and before stopping the next one, you need to be
sure that there are no damaged files. You shouldn't reboot a node if
"gluster volume heal <volname> info" shows damaged files.

The command "gluster volume heal <volname>" is only a tool to
force heal to progress (until the bug is fixed).

Xavi




________________________________
From: Xavi Hernandez <jahernan at redhat.com<mailto:jahernan at
redhat.com>>
Sent: Thursday, March 15, 2018 12:09:05 AM
To: Victor T
Cc: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
Subject: Re: [Gluster-users] Disperse volume recovery and healing

Hi Victor,

On Wed, Mar 14, 2018 at 12:30 AM, Victor T <hero_of_nothing_1 at
hotmail.com<mailto:hero_of_nothing_1 at hotmail.com>> wrote:

I have a question about how disperse volumes handle brick failure. I'm
running version 3.10.10 on all systems. If I have a disperse volume in a 4+2
configuration with 6 servers each serving 1 brick, and maintenance needs to be
performed on all systems, are there any general steps that need to be taken to
ensure data is not lost or service interrupted? For example, can I just reboot
each system sequentially after making sure sure the service is running on all
servers before rebooting the next system? Or is there a need to force/wait for a
heal after each brick comes back online? If I have two bricks down for multiple
days and then bring them back in, is there a need to issue a heal or something
like a rebalance before rebooting the other servers? There's lots of
documentation about other volume types, but it seems information specific to
dispersed volumes is a bit hard to find. Thanks a bunch.

On a 4+2 configuration you could bring down up to 2 bricks simultaneously for
maintenance. However if something happens to one of the remaining 4 bricks, the
volume would stop working. So in this case I would recommend to not have more
than one server down for maintenance at the same time unless the down time is
very very small.

Once the stopped servers come back up again, you need to wait until all files
are healed before proceeding with the next server. Failing to do so means that
some files could have more than 2 non-healthy versions, what will make the file
inaccessible until enough healthy versions are available again.

Self-heal should be automatically triggered once the bricks come online, however
there was a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1547662) that could
cause delays in the self-heal process. This bug should be fixed in the next
version. Meantime you can force self-heal to progress by issuing "gluster
volume heal <volname>" commands each time it seems to have stopped.

Once the output of "gluster volume heal <volname> info" reports
0 pending files on all bricks, you can proceed with the maintenance of the next
server.

No need to do any rebalance for down bricks. Rebalance is basically needed when
volume is expanded with more bricks.

Xavi


_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180318/0dac152d/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

Gluster users - Mar 2018 - Disperse volume recovery and healing

[Gluster-users] Disperse volume recovery and healing

[Gluster-users] Disperse volume recovery and healing

[Gluster-users] Disperse volume recovery and healing

Possibly Parallel Threads