thr3ads.net - Gluster users - [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick. [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Anand Avati

2013-Sep-27 07:35 UTC

[Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

Hello all,
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.

This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.

Reasons to remove replace-brick (or why remove-brick is better):

- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.

- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)

- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the
same goal as
replace-brick <old> <new> "start".

- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.

- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.

- Replace-brick code is complex and messy (the real reason :p).

- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.

I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.

NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.

Please do ask any questions / raise concerns at this stage :)

Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130927/07890419/attachment.html>

James

2013-Sep-27 08:56 UTC

head link

[Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:> Hello all,Hey,

Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.

Here's the problem:
Given a logically optimal initial volume:

volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2

suppose I know that I want to add/remove bricks such that my new volume
(if I had created it new) looks like:

volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2

What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.

The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.

Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We want
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
"chain" of bricks as illustrated in the bottom of this image:
http://joejulian.name/media/uploads/images/replica_expansion.png
for example. I'd like to optimize for safety first, and then time, I
imagine.

Many thanks in advance.

James

Some comments below, although I'm a bit tired so I hope I said it all
right.
> DHT's remove-brick + rebalance has been enhanced in the last couple of
> releases to be quite sophisticated. It can handle graceful decommissioning
> of bricks, including open file descriptors and hard links.Sweet
> 
> This in a way is a feature overlap with replace-brick's data migration
> functionality. Replace-brick's data migration is currently also used
for
> planned decommissioning of a brick.
> 
> Reasons to remove replace-brick (or why remove-brick is better):
> 
> - There are two methods of moving data. It is confusing for the users and
> hard for developers to maintain.
> 
> - If server being replaced is a member of a replica set, neither
> remove-brick nor replace-brick data migration is necessary, because
> self-healing itself will recreate the data (replace-brick actually uses
> self-heal internally)
> 
> - In a non-replicated config if a server is getting replaced by a new one,
> add-brick <new> + remove-brick <old> "start" achieves
the same goal as
> replace-brick <old> <new> "start".
> 
> - In a non-replicated config, <replace-brick> is NOT glitch free
> (applications witness ENOTCONN if they are accessing data) whereas
> add-brick <new> + remove-brick <old> is completely transparent.
> 
> - Replace brick strictly requires a server with enough free space to hold
> the data of the old brick, whereas remove-brick will evenly spread out the
> data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
> 
> - Replace-brick code is complex and messy (the real reason :p).
> 
> - No clear reason why replace-brick's data migration is better in any
way
> to remove-brick's data migration.
> 
> I plan to send out patches to remove all traces of replace-brick data
> migration code by 3.5 branch time.
> 
> NOTE that replace-brick command itself will still exist, and you can
> replace on server with another in case a server dies. It is only the data
> migration functionality being phased out.
> 
> Please do ask any questions / raise concerns at this stage :)I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?

Thanks!
James
> 
> Avati
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130927/6a9d839e/attachment.sig>

Amar Tumballi

2013-Sep-27 17:15 UTC

head link

[Gluster-users] [Gluster-devel] Phasing out replace-brick for data migration in favor of remove-brick.

> Hello all,
> DHT's remove-brick + rebalance has been enhanced in the last couple of
> releases to be quite sophisticated. It can handle graceful decommissioning
> of bricks, including open file descriptors and hard links.
>
>Last set of patches for this should be reviewed and accepted before we make
that claim :-) [ http://review.gluster.org/5891 ]

> This in a way is a feature overlap with replace-brick's data migration
> functionality. Replace-brick's data migration is currently also used
for
> planned decommissioning of a brick.
>
> Reasons to remove replace-brick (or why remove-brick is better):
>
> - There are two methods of moving data. It is confusing for the users and
> hard for developers to maintain.
>
> - If server being replaced is a member of a replica set, neither
> remove-brick nor replace-brick data migration is necessary, because
> self-healing itself will recreate the data (replace-brick actually uses
> self-heal internally)
>
> - In a non-replicated config if a server is getting replaced by a new one,
> add-brick <new> + remove-brick <old> "start" achieves
the same goal as
> replace-brick <old> <new> "start".
>
>Should we phase out CLI of doing a 'remove-brick' without any option
too?
because even if users do it by mistake, they would loose data. We should
enforce 'start' and then 'commit' usage of remove-brick. Also if
old method
is required for anyone, they anyways have 'force' option.


> - In a non-replicated config, <replace-brick> is NOT glitch free
> (applications witness ENOTCONN if they are accessing data) whereas
> add-brick <new> + remove-brick <old> is completely transparent.
>
>+10 (thats the number of bugs open on these things :-)

> - Replace brick strictly requires a server with enough free space to hold
> the data of the old brick, whereas remove-brick will evenly spread out the
> data of the bring being removed amongst the remaining servers.
>
> - Replace-brick code is complex and messy (the real reason :p).
>
>Wanted to see this reason as 1st point, but its ok as long as we mention
about this. I too agree that its _hard_ to maintain that piece of code.

> - No clear reason why replace-brick's data migration is better in any
way
> to remove-brick's data migration.
>
>One reason I heard when I sent the mail on gluster-devel earlier (
http://lists.nongnu.org/archive/html/gluster-devel/2012-10/msg00050.html )
was that the remove-brick way was bit slower than that of replace-brick.
Technical reason being remove-brick does DHT's readdir, where as
replace-brick does the brick level readdir.

> I plan to send out patches to remove all traces of replace-brick data
> migration code by 3.5 branch time.
>
> Thanks for the initiative, let me know if you need help.
> NOTE that replace-brick command itself will still exist, and you can
> replace on server with another in case a server dies. It is only the data
> migration functionality being phased out.
>
>Yes, we need to be careful about this. We would need 'replace-brick' to
phase out a dead brick. The other day, there was some discussion on have
'gluster peer replace <old-peer> <new-peer>' which would
re-write all the
vol files properly. But thats mostly for 3.6 time frame IMO.


> Please do ask any questions / raise concerns at this stage :)
>
>
> What is the window before you start sending out patches ?? I seehttp://review.gluster.org/6010 which I guess is not totally complete
without phasing out pump xlator :-)

I personally am all in for this change, as it helps me to finish few more
enhancements I am working on like 'discover()' changes etc...

Regards,
Amar
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130927/b546f5ea/attachment.html>

Gluster users - Sep 2013 - Phasing out replace-brick for data migration in favor of remove-brick.

[Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

[Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

[Gluster-users] [Gluster-devel] Phasing out replace-brick for data migration in favor of remove-brick.