thr3ads.net - llvm dev - [llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo [Nov 2018]

If this information is useful, please help other people find it:
Share via:

James Y Knight via llvm-dev

2018-Nov-02 16:58 UTC

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

Thanks for writing this up. I think it's a really important point which
deserves discussion.

Ultimately, I think it is a question as to whether to prioritize the easy
switchover for existing out of tree forks, or to prioritize having the best
conversion we can make. I feel very strongly that the latter should be the
priority for the official repository conversion, and that, therefore, we
should not use the zipper method for the official repository going forward.

However, it's also worth putting much thought into making switchover as
easy as possible within the confines of what's possible given that
prioritization.

On Wed, Oct 31, 2018 at 12:22 PM Justin Bogner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> An arguably cleaner solution would be try to recreate all of my trees'
> history artificially as if they were based on the monorepo prototype
> history all along, but this has two problems. First, it's a very
> significant tooling effort to do this - I'd need to match up several
> years of merge points to their corresponding spots in the monorepo
> prototype and somehow redo all of the merges in the same ways. Tools
> like "rebase --preserve-merges" don't really help here, since
they abort
> on merge conflicts and ask a human to resolve them again.

I realized I had most of the functionality needed for this already written
for such a conversion tool, so I've written a tool which is able to
(mostly!) convert a single-project repository (with all its commits and
merges), into a monorepo repository (with the same commits and merges). The
transform is conceptually trivial -- take the subproject's tree from the
old commit, and take the rest of the content from the monorepo parent.
That's perfect -- no need to deal with any conflict resolution, UNLESS
there are potential merge conflicts in the parts of the tree OUTSIDE the
original repository's subproject.

As the original repository will -- by definition -- not touch the other
directories, such a conflict can only happen if you have merges between
upstream-svn release branches in your history. E.g., if, in your fork of
clang.git, you started working from the release_50 branch, then
(potentially after a bunch of work), merged the release_60 branch. In your
clang fork, you of course had to resolve any conflicts in clang, but would
NOT have resolved conflicts between release_50 and release_60 in
"llvm" or
other subprojects. The tool can't necessarily know what to do here either.

Now, in that case, it's pretty likely that you'd want to just take the
release_60 tree as is, throwing out the changes that happened only on the
release_50 branch. So, if this seems useful, I can imagine adding some
heuristics or manual override to support that particular case.

I'll post the tool soon. It could also be extended to support conversions
from the previous monorepo repositories to make that easier for folks too.

Even if I were> to come up with tooling that managed this, I'm still left with a
> completely new set of hashes for commits and no easy way to map them to
> existing references in emails, bug trackers, and release notes

*Creating* such a commit mapping is certainly easy.

[....]

One more option -- which I've not yet tried, but seems like it could be
really promising -- would be to have _your_ repository's history have a
different shape from everyone else's, but still keep the same commit hashes
at head, going forward. Of course "That's impossible!" -- editing
the
history will necessarily change the hash! But, actually, you can pull this
off using "replace" refs (see "git replace").

Start with the git merge you already created (merging all your
split-repositories into one branch on top of a monorepo-prototype commit).
Then, "git replace" the monorepo-prototype commit that you merged in
with a
commit that has the same content, but from your "zippered" repository
history. That won't change the hash (thus, future merging will work
properly), but it effectively changes the history to be the way you'd like
to see it.

Thus you'll see the zipped history up until that point, avoiding seeing
multiple copies of svn commits in your history, and you can use the new
monorepo commits going forward.

(One note -- users would need to fetch the replace ref after cloning a new
repo (e.g., with `git fetch origin refs/replace/*:refs/replace/*`), since
clone won't fetch it automatically. If they forgot to, they'd simply see
the "normal" history, rather than the zipper history.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181102/aa51bc52/attachment-0001.html>

Justin Bogner via llvm-dev

2018-Nov-02 18:11 UTC

head link

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

James Y Knight <jyknight at google.com> writes:> Thanks for writing this up. I think it's a really important point which
> deserves discussion.
>
> Ultimately, I think it is a question as to whether to prioritize the easy
> switchover for existing out of tree forks, or to prioritize having the best
> conversion we can make. I feel very strongly that the latter should be the
> priority for the official repository conversion, and that, therefore, we
> should not use the zipper method for the official repository going forward.
How do you define "best conversion" here? I may be missing something,
but I really don't see any actual advantage to re-writing the git
history from scratch rather than leveraging the existing git mirrors to
build a monorepo.

The re-generated history approach gives us an artificial alternate
history where we developed in a git monorepo from the beginning of time.
It throws away a bunch of information for the sake of making a
"pristine" conversion with fewer branches, even though those branches
have almost zero cost.

The zipper approach gives us the best of both worlds - it provides a
monorepo view for all time for anyone who wants it, but also preserves
the history that people have been using and relying on for a number of
years.

I'd like to hear what you think are the actual disadvantages of the
zipper approach. I've spoken to quite a few people about it in the last
few days and I haven't really found any yet.
> However, it's also worth putting much thought into making switchover as
> easy as possible within the confines of what's possible given that
> prioritization.
While I obviously agree with your intent here, I'm not convinced that
arbitrarily prioritizing a vacuous definition of "the best conversion"
over engineering tradeoffs makes sense. The cost of the switchover is a
real cost that will affect a large portion of the llvm community, and if
the cost is too high it's very likely that it will delay the process of
switching to git even further as various groups aren't able to get
themselves switched over by whatever deadlines we set.
> On Wed, Oct 31, 2018 at 12:22 PM Justin Bogner via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> An arguably cleaner solution would be try to recreate all of my
trees'
>> history artificially as if they were based on the monorepo prototype
>> history all along, but this has two problems. First, it's a very
>> significant tooling effort to do this - I'd need to match up
several
>> years of merge points to their corresponding spots in the monorepo
>> prototype and somehow redo all of the merges in the same ways. Tools
>> like "rebase --preserve-merges" don't really help here,
since they abort
>> on merge conflicts and ask a human to resolve them again.
>
> I realized I had most of the functionality needed for this already written
> for such a conversion tool, so I've written a tool which is able to
> (mostly!) convert a single-project repository (with all its commits and
> merges), into a monorepo repository (with the same commits and merges). The
> transform is conceptually trivial -- take the subproject's tree from
the
> old commit, and take the rest of the content from the monorepo parent.
> That's perfect -- no need to deal with any conflict resolution, UNLESS
> there are potential merge conflicts in the parts of the tree OUTSIDE the
> original repository's subproject.
>
> As the original repository will -- by definition -- not touch the other
> directories, such a conflict can only happen if you have merges between
> upstream-svn release branches in your history. E.g., if, in your fork of
> clang.git, you started working from the release_50 branch, then
> (potentially after a bunch of work), merged the release_60 branch. In your
> clang fork, you of course had to resolve any conflicts in clang, but would
> NOT have resolved conflicts between release_50 and release_60 in
"llvm" or
> other subprojects. The tool can't necessarily know what to do here
either.
>
> Now, in that case, it's pretty likely that you'd want to just take
the
> release_60 tree as is, throwing out the changes that happened only on the
> release_50 branch. So, if this seems useful, I can imagine adding some
> heuristics or manual override to support that particular case.
>
> I'll post the tool soon. It could also be extended to support
conversions
> from the previous monorepo repositories to make that easier for folks too.
I'm looking forward to trying this out, but I'm still skeptical about
whether it will be enough.

Consider a world where I convert all of my branches as-if they were
based on the monorepo. Now, something comes up and I need to hot fix
last year's branch. I probably can't actually submit this fix from the
monorepo, since it would be too disruptive to also hot fix the
configuration changes to submit from a new layout of repositories. Now I
need to maintain two copies of my code and manage merging between them.
>> Even if I were to come up with tooling that managed this, I'm still
>> left with a completely new set of hashes for commits and no easy way
>> to map them to existing references in emails, bug trackers, and
>> release notes
>
> *Creating* such a commit mapping is certainly easy.
Finding every place to update is much harder, and actually updating them
is probably impractical.

It should be possible to write a tool to find every commit reference in
my bug tracker and add a comment saying where it moved, but everyone
will have to be careful to check for this update if they look at some
particular comment. Emails and release notes are immutable.

I suppose I could provide a tool that mapped an old hash to a new one
and teach everyone to use it every time they looked at a hash related to
my repos. When could we stop using this tool every time we look at a
hash though? 5 years? 10 years?
> [....]
>
>
> One more option -- which I've not yet tried, but seems like it could be
> really promising -- would be to have _your_ repository's history have a
> different shape from everyone else's, but still keep the same commit
hashes
> at head, going forward. Of course "That's impossible!" --
editing the
> history will necessarily change the hash! But, actually, you can pull this
> off using "replace" refs (see "git replace").
>
> Start with the git merge you already created (merging all your
> split-repositories into one branch on top of a monorepo-prototype commit).
> Then, "git replace" the monorepo-prototype commit that you merged
in with a
> commit that has the same content, but from your "zippered"
repository
> history. That won't change the hash (thus, future merging will work
> properly), but it effectively changes the history to be the way you'd
like
> to see it.
>
> Thus you'll see the zipped history up until that point, avoiding seeing
> multiple copies of svn commits in your history, and you can use the new
> monorepo commits going forward.
>
> (One note -- users would need to fetch the replace ref after cloning a new
> repo (e.g., with `git fetch origin refs/replace/*:refs/replace/*`), since
> clone won't fetch it automatically. If they forgot to, they'd
simply see
> the "normal" history, rather than the zipper history.)
So, we publish two competing versions of the git history and let people
choose? This sounds like a splitting the baby type solution to me ;)

James Y Knight via llvm-dev

2018-Nov-02 21:56 UTC

head link

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

On Fri, Nov 2, 2018 at 2:11 PM Justin Bogner <mail at justinbogner.com>
wrote:
> James Y Knight <jyknight at google.com> writes:
> > Thanks for writing this up. I think it's a really important point
which
> > deserves discussion.
> >
> > Ultimately, I think it is a question as to whether to prioritize the
easy
> > switchover for existing out of tree forks, or to prioritize having the
> best
> > conversion we can make. I feel very strongly that the latter should be
> the
> > priority for the official repository conversion, and that, therefore,
we
> > should not use the zipper method for the official repository going
> forward.
>
> How do you define "best conversion" here? I may be missing
something,
> but I really don't see any actual advantage to re-writing the git
> history from scratch rather than leveraging the existing git mirrors to
> build a monorepo.
>
> The re-generated history approach gives us an artificial alternate
> history where we developed in a git monorepo from the beginning of time.
>
I note that "we", where "we" = llvm upstream developers,
*have* been
developing in a monorepo -- an SVN monorepo, with a linear history. The
llvm-git-prototype repository better matches this actual development
history.

It throws away a bunch of information for the sake of making
a> "pristine" conversion with fewer branches, even though those
branches
> have almost zero cost.
>
You mean "merge commits" here, not "branches", I believe,
since your
repository has a single branch, "master".

The zipper approach gives us the best of both worlds - it provides
a> monorepo view for all time for anyone who wants it, but also preserves
> the history that people have been using and relying on for a number of
> years.
>

> I'd like to hear what you think are the actual disadvantages of the
> zipper approach. I've spoken to quite a few people about it in the last
> few days and I haven't really found any yet.
>
The downside, generally, is that it makes the history _very_ complex. Which
is not necessarily bad, in of it self, but it's not really representative
of the development history of llvm, and it turns out that it causes
problems.

Two concrete disadvantages have been mentioned on this thread, already:
1. gitk cannot be used -- it just falls over when given the history with
300000 merges.
2. git bisect becomes somewhat trickier.

I'd add a couple more to that:

3. "git log -u llvm/" no longer works (for any file/path), because the
commits which *actually* changed the files don't occur at that path, and
the default is to omit diffs arising in a merge commit. (The actual content
change happened at a different path -- the root of the tree, not under
"llvm/", and is just moved under llvm/ in the merge commit.)

You can work around this, via "git log -u -m --first-parent llvm/" to
get
the diffs from the merge commit itself. But this is a large annoyance --
looking at path histories is a very common task. Making matters worse right
now is that the zipper merge-commit doesn't have the full commit message,
only the first line. That, at least, can be fixed.

4. Other commands like "git log -u -S CFGStackify" become trickier --
it
returns the individual project commit, not the merge commit, and gives the
"wrong" pathnames (without the subproject prefix). So every time you
look
at one of these, you need to map it back yourself.

Those are just the first things I tried -- I think there will be *more*
variants of these sorts of issues which will show up with further attempts
to use a repository built in this style. Certainly none of this is FATAL
problems, but will be a constant irritation.

Consider a world where I convert all of my branches as-if they
were> based on the monorepo. Now, something comes up and I need to hot fix
> last year's branch. I probably can't actually submit this fix from
the
> monorepo, since it would be too disruptive to also hot fix the
> configuration changes to submit from a new layout of repositories. Now I
> need to maintain two copies of my code and manage merging between them

I can't speak to your exact problem, not knowing anything about your
infrastructure or workflows. But, if you were to want to keep using your
old separated repositories for your old branches, and to switch only master
and future branches over to the new monorepo, "git format-patch" and
"git
apply" do make copying commits or stacks of commits between completely
different repositories (even between split and mono repositories)
relatively straightforward.

[...]

So, we publish two competing versions of the git history and let
people> choose? This sounds like a splitting the baby type solution to me ;)
>
The point here is not to offer "competing" versions -- the
llvm-git-prototype monorepo (with the linear history) would be the official
repository, recommended by default for everyone.

The technique outlined above simply offers a solution for those who may
desire to have their historical commits appear in the zipper-merge fashion,
to do so up to the point in history where they last pulled from the split
git repositories into their private forks, and to then switch to the
official monorepo afterward.

I'd still recommend avoiding that, generally, because of the issues caused
by the more complex structure of the zipper-repository. But if that's the
path that is best for your repository and infrastructure, I believe it's
feasible to do so without needing to impact others.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181102/c277f352/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Nov 2018 - RFC: Dealing with out of tree changes and the LLVM git monorepo

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

[llvm-dev] RFC: Dealing with out of tree changes and the LLVM git monorepo

Seemingly Similar Threads