thr3ads.net - llvm dev - [llvm-dev] Sequential ID Git hook [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Jun-30 11:42 UTC

[llvm-dev] Sequential ID Git hook

Now that we seem to be converging to an acceptable Git model, there
was only one remaining doubt, and that's how the trigger to update a
sequential ID will work. I've been in contact with GitHub folks, and
this is in line with their suggestions...

Given the nature of our project's repository structure, triggers in
each repository can't just update their own sequential ID (like
Gerrit) because we want a sequence in order for the whole project, not
just each component. But it's clear to me that we have to do something
similar to Gerrit, as this has been proven to work on a larger
infrastructure.

Adding an incremental "Change-ID" to the commit message should
suffice, in the same way we have for SVN revisions now, if we can
guarantee that:

 1. The ID will be unique across *all* projects
 2. Earlier pushes will get lower IDs than later ones

Other things are not important:

 3. We don't need the ID space to be complete (ie, we can jump from
123 to 125 if some error happens)
 4. We don't need an ID for every "commit", but for every push. A
multi-commit push is a single feature, and doing so will help
buildbots build the whole set as one change. Reverts should also be
done in one go.

What's left for the near future:

 5. We don't yet handle multi-repository patch-sets. A way to
implement this is via manual Change-ID manipulation (explained below).
Not hard, but not a priority.


  Design decisions

This could be a pre/post-commit trigger on each repository that
receives an ID from somewhere (TBD) and updates the commit message.
When the umbrella project synchronises, it'll already have the
sequential number in. In this case, the umbrella project is not
necessary for anything other than bisect, buildbots and releases.

I personally believe that having the trigger in the umbrella project
will be harder to implement and more error prone.

The server has to have some kind of locking mechanism. Web services
normally spawn dozens of "listeners", meaning multiple pushes
won't
fail to get a response, since the lock will be further down, after the
web server.

Therefore, the lock for the unique increment ID has to be elsewhere.
The easiest thing I can think of is a SQL database with auto-increment
ID. Example:

Initially:
sql> create table LLVM_ID ( id int not null primary key
auto_increment, repository varchar not null, hash varchar nut null );
sql> alter table LLVM_ID auto_increment = 300000;

On every request:
sql> insert into LLVM_ID values ("$repo_name", "$hash");
sql> select_last_inset_id(); -> return

and then print the "last insert id" back to the user in the body of
the page, so the hook can update the Change-id on the commit message.
The repo/hash info is more for logging, debugging and conflict
resolution purposes.

We also must limit the web server to only accept connections from
GitHub's servers, to avoid abuse. Other repos in GitHub could still
abuse, and we can go further if it becomes a problem, but given point
(3) above, we may fix that only if it does happen.

This solution doesn't scale to multiple servers, nor helps BPC
planning. Given the size of our needs, it not relevant.


  Problems

If the server goes down, given point (3), we may not be able to
reproduce locally the same sequence as the server would. Meaning
SVN-based bisects and releases would not be possible during down
times. But Git bisect and everything else would.

Furthermore, even if a local script can't reproduce exactly what the
server would do, it still can make it linear for bisect purposes,
fixing the local problem. I can't see a situation in which we need the
sequence for any other purpose.

Upstream and downstream releases can easily wait a day or two in the
unlucky situation that the server goes down in the exact time the
release will be branched.

Migrations and backups also work well, and if we use some cloud
server, we can easily take snapshots every week or so, migrate images
across the world, etc. We don't need duplication, read-only scaling,
multi-master, etc., since only the web service will be writing/reading
from it.

All in all, a "robust enough" solution for our needs.


  Bundle commits

Just FYI, here's a proposal that appeared in the "commit message
format" round of emails a few months ago, and that can work well for
bundling commits together, but will need more complicated SQL
handling.

The current proposal is to have one ID per push. This is easy by using
auto_increment. But if we want to have one ID per multiple pushes, on
different repositories, we'll need to have the same ID on two or more
"repo/hash" pairs.

On the commit level, the developer adds a temporary hash, possibly
generated by a local script in 'utils'. Example:

  Commit-ID: 68bd83f69b0609942a0c7dc409fd3428

This ID will have to be the same on both (say) LLVM and Clang commits.

The script will then take that hash, generate an ID, and then if it
receives two or more pushes with such hashes, it'll return the *same*
ID, say 123456, in which case the Git hooks on all projects will
update the commit message by replacing the original Commit-ID to:

  Commit-ID: 123456

To avoid hash clashes in the future, the server script can refuse
existing hashes that are a few hours old and return error, in which
case the developer generates a new hash, update all commit messages
and re-push.

If there is no Commit-ID, or if it's empty, we just insert a new empty
line, get the auto increment ID and return. Meaning, empty Commit-IDs
won't "match" any other.

To solve this on the server side, a few ways are possible:

A. We stop using primary_key auto_increment, handle the increment in
the script and use SQL transactions.

This would be feasible, but more complex and error prone. I suggest we
go down that route only if keeping the repo/hash information is really
important.

B. We ditch keeping record of repo/hash and just re-use the ID, but
record the original string, so we can match later.

This keeps it simple and will work for our purposes, but we'll lose
the ability to debug problems if they happen in the future.

C. We improve the SQL design to have two tables:

LLVM_ID:
   * ID: int PK auto
   * Key: varchar null

LLVM_PUSH:
   * LLVM_ID: int FK (LLVM_ID:ID)
   * Repo: varchar not null
   * Push: varchar not null

Every new push updates both tables, returns the ID. Pushes with the
same Key re-use the ID and update only LLVM_PUSH, returns the same ID.

This is slightly more complicated, will need to code scripts to gather
information (for logging, debug), but give us both benefits
(debug+auto_increment) in one package. As a start, I'd recommend we
take this route even before the script supports it. But it could be
simple enough that we add support for it right from the beginning.

I vote for option C.


  Deployment

I recommend we code this, setup a server, let it running for a while
on our current mirrors *before* we do the move. A simple plan is to:

* Develop the server, hooks and set it running without updating the
commit message.
* We follow the logs, make sure everything is sane
* Change the hook to start updating the commit message
* We follow the commit messages, move some buildbots to track GitHub
(SVN still master)
* When all bots are live tracking GitHub and all developers have moved, we flip.

Sounds good?

cheers,
--renato

Tom Honermann via llvm-dev

2016-Jun-30 15:13 UTC

head link

[llvm-dev] Sequential ID Git hook

On 6/30/2016 7:43 AM, Renato Golin via llvm-dev wrote:> Given the nature of our project's repository structure, triggers in
> each repository can't just update their own sequential ID (like
> Gerrit) because we want a sequence in order for the whole project, not
> just each component. But it's clear to me that we have to do something
> similar to Gerrit, as this has been proven to work on a larger
> infrastructure.
I'm assuming that pushes to submodules will result in a (nearly) 
immediate commit/push to the umbrella repo to update it with the new 
submodule head.  Otherwise, checking out the umbrella repo won't get you 
the latest submodule updates.

Since updates to the umbrella project are needed to synchronize it for 
updates to sub-modules, it seems to me that if you want an ID that 
applies to all projects, that it would have to be coordinated relative 
to the umbrella project.
>   Design decisions
>
> This could be a pre/post-commit trigger on each repository that
> receives an ID from somewhere (TBD) and updates the commit message.
> When the umbrella project synchronises, it'll already have the
> sequential number in. In this case, the umbrella project is not
> necessary for anything other than bisect, buildbots and releases.
I recommend using git tag rather than updating the commit message 
itself.  Tags are more versatile.
> I personally believe that having the trigger in the umbrella project
> will be harder to implement and more error prone.
Relative to a SQL database and a server, I think managing the ID from 
the umbrella repository would be much simpler and more reliable.

Managing IDs from a repo using git meta data is pretty simple.  Here's 
an example script that creates a repo and allocates a push tag in 
conjunction with a sequence of commits (here I'm simulating pushes of 
individual commits rather than using git hooks for simplicity).  I'm not 
a git expert, so there may be better ways of doing this, but I don't 
know of any problems with this approach.

#!/bin/sh

rm -rf repo

# Create a repo
mkdir repo
cd repo
git init

# Create a well known object.
PUSH_OBJ=$(echo "push ID" | git hash-object -w --stdin)
echo "PUSH_OBJ: $PUSH_OBJ"

# Initialize the push ID to 0.
git notes add -m 0 $PUSH_OBJ

# Simulate some commits and pushes.
for i in 1 2 3; do
   echo $i > file$i
   git add file$i
   git commit -m "Added file$i" file$i
   PUSH_TAG=$(git notes show $PUSH_OBJ)
   PUSH_TAG=$((PUSH_TAG+1))
   git notes add -f -m $PUSH_TAG $PUSH_OBJ
   git tag -m "push-$PUSH_TAG" push-$PUSH_TAG
done

# list commits with push tags
git log --decorate=full

Running the above shows a git log with the tags:

commit a4ca4a0b54d5fb61a2dacbab5732d00cf8216029 (HEAD, tag: 
refs/tags/push-3, refs/heads/master)
...
     Added file3

commit e98e2669569d5cfb15bf4cd1f268507873bcd63f (tag: refs/tags/push-2)
...
     Added file2

commit 5c7f29107838b4af91fe6fa5c2fc5e3769b87bef (tag: refs/tags/push-1)
...
     Added file1

The above script is not transaction safe because it runs commands 
individually.  In a real deployment, git hooks would be used and would 
rely on push locks to synchronize updates.  Those hooks could also 
distribute ID updates to the submodules to keep them synchronized.

Tom.

James Y Knight via llvm-dev

2016-Jun-30 16:16 UTC

head link

[llvm-dev] Sequential ID Git hook

I don't think we should do any of that. It's too complicated -- and I
don't
see the reason to even do it.

There's a need for the "llvm-project" repository -- that's
been discussed
plenty -- but where does the need for a separate "id" that must be
pushed
into all of the sub-projects come from? This is the first I've heard of
that as a thing that needs to be done.

There was a previous discussion about putting an sequential ID in the
"llvm-project" repo commit messages (although, even that I'd say
is
unnecessary), but not anywhere else.



On Thu, Jun 30, 2016 at 7:42 AM, Renato Golin via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Now that we seem to be converging to an acceptable Git model, there
> was only one remaining doubt, and that's how the trigger to update a
> sequential ID will work. I've been in contact with GitHub folks, and
> this is in line with their suggestions...
>
> Given the nature of our project's repository structure, triggers in
> each repository can't just update their own sequential ID (like
> Gerrit) because we want a sequence in order for the whole project, not
> just each component. But it's clear to me that we have to do something
> similar to Gerrit, as this has been proven to work on a larger
> infrastructure.
>
> Adding an incremental "Change-ID" to the commit message should
> suffice, in the same way we have for SVN revisions now, if we can
> guarantee that:
>
>  1. The ID will be unique across *all* projects
>  2. Earlier pushes will get lower IDs than later ones
>
> Other things are not important:
>
>  3. We don't need the ID space to be complete (ie, we can jump from
> 123 to 125 if some error happens)
>  4. We don't need an ID for every "commit", but for every
push. A
> multi-commit push is a single feature, and doing so will help
> buildbots build the whole set as one change. Reverts should also be
> done in one go.
>
> What's left for the near future:
>
>  5. We don't yet handle multi-repository patch-sets. A way to
> implement this is via manual Change-ID manipulation (explained below).
> Not hard, but not a priority.
>
>
>   Design decisions
>
> This could be a pre/post-commit trigger on each repository that
> receives an ID from somewhere (TBD) and updates the commit message.
> When the umbrella project synchronises, it'll already have the
> sequential number in. In this case, the umbrella project is not
> necessary for anything other than bisect, buildbots and releases.
>
> I personally believe that having the trigger in the umbrella project
> will be harder to implement and more error prone.
>
> The server has to have some kind of locking mechanism. Web services
> normally spawn dozens of "listeners", meaning multiple pushes
won't
> fail to get a response, since the lock will be further down, after the
> web server.
>
> Therefore, the lock for the unique increment ID has to be elsewhere.
> The easiest thing I can think of is a SQL database with auto-increment
> ID. Example:
>
> Initially:
> sql> create table LLVM_ID ( id int not null primary key
> auto_increment, repository varchar not null, hash varchar nut null );
> sql> alter table LLVM_ID auto_increment = 300000;
>
> On every request:
> sql> insert into LLVM_ID values ("$repo_name",
"$hash");
> sql> select_last_inset_id(); -> return
>
> and then print the "last insert id" back to the user in the body
of
> the page, so the hook can update the Change-id on the commit message.
> The repo/hash info is more for logging, debugging and conflict
> resolution purposes.
>
> We also must limit the web server to only accept connections from
> GitHub's servers, to avoid abuse. Other repos in GitHub could still
> abuse, and we can go further if it becomes a problem, but given point
> (3) above, we may fix that only if it does happen.
>
> This solution doesn't scale to multiple servers, nor helps BPC
> planning. Given the size of our needs, it not relevant.
>
>
>   Problems
>
> If the server goes down, given point (3), we may not be able to
> reproduce locally the same sequence as the server would. Meaning
> SVN-based bisects and releases would not be possible during down
> times. But Git bisect and everything else would.
>
> Furthermore, even if a local script can't reproduce exactly what the
> server would do, it still can make it linear for bisect purposes,
> fixing the local problem. I can't see a situation in which we need the
> sequence for any other purpose.
>
> Upstream and downstream releases can easily wait a day or two in the
> unlucky situation that the server goes down in the exact time the
> release will be branched.
>
> Migrations and backups also work well, and if we use some cloud
> server, we can easily take snapshots every week or so, migrate images
> across the world, etc. We don't need duplication, read-only scaling,
> multi-master, etc., since only the web service will be writing/reading
> from it.
>
> All in all, a "robust enough" solution for our needs.
>
>
>   Bundle commits
>
> Just FYI, here's a proposal that appeared in the "commit message
> format" round of emails a few months ago, and that can work well for
> bundling commits together, but will need more complicated SQL
> handling.
>
> The current proposal is to have one ID per push. This is easy by using
> auto_increment. But if we want to have one ID per multiple pushes, on
> different repositories, we'll need to have the same ID on two or more
> "repo/hash" pairs.
>
> On the commit level, the developer adds a temporary hash, possibly
> generated by a local script in 'utils'. Example:
>
>   Commit-ID: 68bd83f69b0609942a0c7dc409fd3428
>
> This ID will have to be the same on both (say) LLVM and Clang commits.
>
> The script will then take that hash, generate an ID, and then if it
> receives two or more pushes with such hashes, it'll return the *same*
> ID, say 123456, in which case the Git hooks on all projects will
> update the commit message by replacing the original Commit-ID to:
>
>   Commit-ID: 123456
>
> To avoid hash clashes in the future, the server script can refuse
> existing hashes that are a few hours old and return error, in which
> case the developer generates a new hash, update all commit messages
> and re-push.
>
> If there is no Commit-ID, or if it's empty, we just insert a new empty
> line, get the auto increment ID and return. Meaning, empty Commit-IDs
> won't "match" any other.
>
> To solve this on the server side, a few ways are possible:
>
> A. We stop using primary_key auto_increment, handle the increment in
> the script and use SQL transactions.
>
> This would be feasible, but more complex and error prone. I suggest we
> go down that route only if keeping the repo/hash information is really
> important.
>
> B. We ditch keeping record of repo/hash and just re-use the ID, but
> record the original string, so we can match later.
>
> This keeps it simple and will work for our purposes, but we'll lose
> the ability to debug problems if they happen in the future.
>
> C. We improve the SQL design to have two tables:
>
> LLVM_ID:
>    * ID: int PK auto
>    * Key: varchar null
>
> LLVM_PUSH:
>    * LLVM_ID: int FK (LLVM_ID:ID)
>    * Repo: varchar not null
>    * Push: varchar not null
>
> Every new push updates both tables, returns the ID. Pushes with the
> same Key re-use the ID and update only LLVM_PUSH, returns the same ID.
>
> This is slightly more complicated, will need to code scripts to gather
> information (for logging, debug), but give us both benefits
> (debug+auto_increment) in one package. As a start, I'd recommend we
> take this route even before the script supports it. But it could be
> simple enough that we add support for it right from the beginning.
>
> I vote for option C.
>
>
>   Deployment
>
> I recommend we code this, setup a server, let it running for a while
> on our current mirrors *before* we do the move. A simple plan is to:
>
> * Develop the server, hooks and set it running without updating the
> commit message.
> * We follow the logs, make sure everything is sane
> * Change the hook to start updating the commit message
> * We follow the commit messages, move some buildbots to track GitHub
> (SVN still master)
> * When all bots are live tracking GitHub and all developers have moved, we
> flip.
>
> Sounds good?
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160630/f77f21a1/attachment.html>

Reid Kleckner via llvm-dev

2016-Jun-30 16:33 UTC

head link

[llvm-dev] [lldb-dev] Sequential ID Git hook

On Thu, Jun 30, 2016 at 9:16 AM, James Y Knight via lldb-dev <
lldb-dev at lists.llvm.org> wrote:
> I don't think we should do any of that. It's too complicated -- and
I
> don't see the reason to even do it.
>
> There's a need for the "llvm-project" repository --
that's been discussed
> plenty -- but where does the need for a separate "id" that must
be pushed
> into all of the sub-projects come from? This is the first I've heard of
> that as a thing that needs to be done.
>
> There was a previous discussion about putting an sequential ID in the
> "llvm-project" repo commit messages (although, even that I'd
say is
> unnecessary), but not anywhere else.
>
Agreed, the llvm-project repository can completely take on the role of the
SQL database in Renato's proposal.

Chromium created a "git-number" extension that assigns sequential ids
to
commits in the obvious way, and that provided some continuity with the
"git-svn-id:" footers in commit messages. I'm not sure their
extension is
particularly reusable, though:
https://chromium.googlesource.com/chromium/tools/depot_tools.git/+/master/git_number.py

I think for LLVM, whatever process updates the umbrella repo should add the
sequential IDs to the commit message, and that will help provide continuity
across the git/svn bridge.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160630/9a187661/attachment.html>

Renato Golin via llvm-dev

2016-Jul-05 10:44 UTC

head link

[llvm-dev] Sequential ID Git hook

Quick re-cap.

After a few rounds, not only the "external server" proposal got
obliterated as totally unnecessary, but the idea that we may even need
a hook at all is now challenged.

Jared's idea to use "git describe" is in line with previous
proposals
to use rev-list --count and to do so only up to the previous tag, but
all in one nice and standard little feature.

There were concerns by applying one tag per commit, but most of them
offered weak evidence. However, if "describe" can cover all our needs,
there is no point in even discussing tags.

Just for reference, GitHub *does* have an SVN interface [1], and you
can already checkout a specific revision with "svn checkout -r NNN
repo", which *is already* using "git rev-list --count".

This means that, for SVN based bisects, using GitHub will make it
*completely transparent* for SVN users. You can also base your
releases off an SVN view of the Git repo.

So, to clear up this discussion and finish my proposal to move to
GitHub, my final questions, only to those that *want* SVN
compatibility:

1. Is there anything in the SVN view of GitHub that *doesn't* work for
you? (ie. same as using "rev-list --count")

2. If so, can "git describe" solve the problem?

3. If not, please describe, in details, why <<your alternative
solution>> would be the *only* way forward.

I'll let this sit for a few days, and if no one has any serious issue,
I'll write up the final proposal and start the voting process with the
Foundation.

cheers,
--renato

[1] https://github.com/blog/626-announcing-svn-support

David Chisnall via llvm-dev

2016-Jul-05 11:13 UTC

head link

[llvm-dev] [llvm-foundation] Sequential ID Git hook

On 5 Jul 2016, at 11:44, Renato Golin via llvm-foundation <llvm-foundation at
lists.llvm.org> wrote:> 
> Just for reference, GitHub *does* have an SVN interface [1], and you
> can already checkout a specific revision with "svn checkout -r NNN
> repo", which *is already* using "git rev-list --count".
> 
> This means that, for SVN based bisects, using GitHub will make it
> *completely transparent* for SVN users. You can also base your
> releases off an SVN view of the Git repo.
Note that GitHub (currently, at least) doesn’t export submodules sensibly with
their svn version.  I don’t intend to use the svn thing (the only time that I
have used it in anger was to replace a project that moved to GitHub with an
svn:external that referred to the GitHub repo so people could easily find the
new location), but that would cause problems if anyone wants to do an svn
bisect.

I think it would help to have a description of how to bisect for a clang or lldb
(or some other subproject) regression.  For downstream users, it would also be
nice if tools like git-imerge let you merge clang and llvm together, though
that’s a nice-to-have feature that we currently lack so shouldn’t in any way
block the migration.

David


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3719 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160705/2300af6b/attachment.bin>

Mehdi Amini via llvm-dev

2016-Jul-05 21:45 UTC

head link

[llvm-dev] Sequential ID Git hook

> On Jul 5, 2016, at 3:44 AM, Renato Golin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Quick re-cap.
> 
> After a few rounds, not only the "external server" proposal got
> obliterated as totally unnecessary, but the idea that we may even need
> a hook at all is now challenged.
This is not clear to me. 
How is the umbrella repository updated?

— 
Mehdi


> 
> Jared's idea to use "git describe" is in line with previous
proposals
> to use rev-list --count and to do so only up to the previous tag, but
> all in one nice and standard little feature.
> 
> There were concerns by applying one tag per commit, but most of them
> offered weak evidence. However, if "describe" can cover all our
needs,
> there is no point in even discussing tags.
> 
> Just for reference, GitHub *does* have an SVN interface [1], and you
> can already checkout a specific revision with "svn checkout -r NNN
> repo", which *is already* using "git rev-list --count".
> 
> This means that, for SVN based bisects, using GitHub will make it
> *completely transparent* for SVN users. You can also base your
> releases off an SVN view of the Git repo.
> 
> So, to clear up this discussion and finish my proposal to move to
> GitHub, my final questions, only to those that *want* SVN
> compatibility:
> 
> 1. Is there anything in the SVN view of GitHub that *doesn't* work for
> you? (ie. same as using "rev-list --count")
> 
> 2. If so, can "git describe" solve the problem?
> 
> 3. If not, please describe, in details, why <<your alternative
> solution>> would be the *only* way forward.
> 
> I'll let this sit for a few days, and if no one has any serious issue,
> I'll write up the final proposal and start the voting process with the
> Foundation.
> 
> cheers,
> --renato
> 
> [1] https://github.com/blog/626-announcing-svn-support
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jul 2016 - Sequential ID Git hook

[llvm-dev] Sequential ID Git hook

[llvm-dev] Sequential ID Git hook

[llvm-dev] Sequential ID Git hook

[llvm-dev] [lldb-dev] Sequential ID Git hook

[llvm-dev] Sequential ID Git hook

[llvm-dev] [llvm-foundation] Sequential ID Git hook

[llvm-dev] Sequential ID Git hook

Apparently Analagous Threads