Now that we seem to be converging to an acceptable Git model, there was only one remaining doubt, and that's how the trigger to update a sequential ID will work. I've been in contact with GitHub folks, and this is in line with their suggestions... Given the nature of our project's repository structure, triggers in each repository can't just update their own sequential ID (like Gerrit) because we want a sequence in order for the whole project, not just each component. But it's clear to me that we have to do something similar to Gerrit, as this has been proven to work on a larger infrastructure. Adding an incremental "Change-ID" to the commit message should suffice, in the same way we have for SVN revisions now, if we can guarantee that: 1. The ID will be unique across *all* projects 2. Earlier pushes will get lower IDs than later ones Other things are not important: 3. We don't need the ID space to be complete (ie, we can jump from 123 to 125 if some error happens) 4. We don't need an ID for every "commit", but for every push. A multi-commit push is a single feature, and doing so will help buildbots build the whole set as one change. Reverts should also be done in one go. What's left for the near future: 5. We don't yet handle multi-repository patch-sets. A way to implement this is via manual Change-ID manipulation (explained below). Not hard, but not a priority. Design decisions This could be a pre/post-commit trigger on each repository that receives an ID from somewhere (TBD) and updates the commit message. When the umbrella project synchronises, it'll already have the sequential number in. In this case, the umbrella project is not necessary for anything other than bisect, buildbots and releases. I personally believe that having the trigger in the umbrella project will be harder to implement and more error prone. The server has to have some kind of locking mechanism. Web services normally spawn dozens of "listeners", meaning multiple pushes won't fail to get a response, since the lock will be further down, after the web server. Therefore, the lock for the unique increment ID has to be elsewhere. The easiest thing I can think of is a SQL database with auto-increment ID. Example: Initially: sql> create table LLVM_ID ( id int not null primary key auto_increment, repository varchar not null, hash varchar nut null ); sql> alter table LLVM_ID auto_increment = 300000; On every request: sql> insert into LLVM_ID values ("$repo_name", "$hash"); sql> select_last_inset_id(); -> return and then print the "last insert id" back to the user in the body of the page, so the hook can update the Change-id on the commit message. The repo/hash info is more for logging, debugging and conflict resolution purposes. We also must limit the web server to only accept connections from GitHub's servers, to avoid abuse. Other repos in GitHub could still abuse, and we can go further if it becomes a problem, but given point (3) above, we may fix that only if it does happen. This solution doesn't scale to multiple servers, nor helps BPC planning. Given the size of our needs, it not relevant. Problems If the server goes down, given point (3), we may not be able to reproduce locally the same sequence as the server would. Meaning SVN-based bisects and releases would not be possible during down times. But Git bisect and everything else would. Furthermore, even if a local script can't reproduce exactly what the server would do, it still can make it linear for bisect purposes, fixing the local problem. I can't see a situation in which we need the sequence for any other purpose. Upstream and downstream releases can easily wait a day or two in the unlucky situation that the server goes down in the exact time the release will be branched. Migrations and backups also work well, and if we use some cloud server, we can easily take snapshots every week or so, migrate images across the world, etc. We don't need duplication, read-only scaling, multi-master, etc., since only the web service will be writing/reading from it. All in all, a "robust enough" solution for our needs. Bundle commits Just FYI, here's a proposal that appeared in the "commit message format" round of emails a few months ago, and that can work well for bundling commits together, but will need more complicated SQL handling. The current proposal is to have one ID per push. This is easy by using auto_increment. But if we want to have one ID per multiple pushes, on different repositories, we'll need to have the same ID on two or more "repo/hash" pairs. On the commit level, the developer adds a temporary hash, possibly generated by a local script in 'utils'. Example: Commit-ID: 68bd83f69b0609942a0c7dc409fd3428 This ID will have to be the same on both (say) LLVM and Clang commits. The script will then take that hash, generate an ID, and then if it receives two or more pushes with such hashes, it'll return the *same* ID, say 123456, in which case the Git hooks on all projects will update the commit message by replacing the original Commit-ID to: Commit-ID: 123456 To avoid hash clashes in the future, the server script can refuse existing hashes that are a few hours old and return error, in which case the developer generates a new hash, update all commit messages and re-push. If there is no Commit-ID, or if it's empty, we just insert a new empty line, get the auto increment ID and return. Meaning, empty Commit-IDs won't "match" any other. To solve this on the server side, a few ways are possible: A. We stop using primary_key auto_increment, handle the increment in the script and use SQL transactions. This would be feasible, but more complex and error prone. I suggest we go down that route only if keeping the repo/hash information is really important. B. We ditch keeping record of repo/hash and just re-use the ID, but record the original string, so we can match later. This keeps it simple and will work for our purposes, but we'll lose the ability to debug problems if they happen in the future. C. We improve the SQL design to have two tables: LLVM_ID: * ID: int PK auto * Key: varchar null LLVM_PUSH: * LLVM_ID: int FK (LLVM_ID:ID) * Repo: varchar not null * Push: varchar not null Every new push updates both tables, returns the ID. Pushes with the same Key re-use the ID and update only LLVM_PUSH, returns the same ID. This is slightly more complicated, will need to code scripts to gather information (for logging, debug), but give us both benefits (debug+auto_increment) in one package. As a start, I'd recommend we take this route even before the script supports it. But it could be simple enough that we add support for it right from the beginning. I vote for option C. Deployment I recommend we code this, setup a server, let it running for a while on our current mirrors *before* we do the move. A simple plan is to: * Develop the server, hooks and set it running without updating the commit message. * We follow the logs, make sure everything is sane * Change the hook to start updating the commit message * We follow the commit messages, move some buildbots to track GitHub (SVN still master) * When all bots are live tracking GitHub and all developers have moved, we flip. Sounds good? cheers, --renato
On 6/30/2016 7:43 AM, Renato Golin via llvm-dev wrote:> Given the nature of our project's repository structure, triggers in > each repository can't just update their own sequential ID (like > Gerrit) because we want a sequence in order for the whole project, not > just each component. But it's clear to me that we have to do something > similar to Gerrit, as this has been proven to work on a larger > infrastructure.I'm assuming that pushes to submodules will result in a (nearly) immediate commit/push to the umbrella repo to update it with the new submodule head. Otherwise, checking out the umbrella repo won't get you the latest submodule updates. Since updates to the umbrella project are needed to synchronize it for updates to sub-modules, it seems to me that if you want an ID that applies to all projects, that it would have to be coordinated relative to the umbrella project.> Design decisions > > This could be a pre/post-commit trigger on each repository that > receives an ID from somewhere (TBD) and updates the commit message. > When the umbrella project synchronises, it'll already have the > sequential number in. In this case, the umbrella project is not > necessary for anything other than bisect, buildbots and releases.I recommend using git tag rather than updating the commit message itself. Tags are more versatile.> I personally believe that having the trigger in the umbrella project > will be harder to implement and more error prone.Relative to a SQL database and a server, I think managing the ID from the umbrella repository would be much simpler and more reliable. Managing IDs from a repo using git meta data is pretty simple. Here's an example script that creates a repo and allocates a push tag in conjunction with a sequence of commits (here I'm simulating pushes of individual commits rather than using git hooks for simplicity). I'm not a git expert, so there may be better ways of doing this, but I don't know of any problems with this approach. #!/bin/sh rm -rf repo # Create a repo mkdir repo cd repo git init # Create a well known object. PUSH_OBJ=$(echo "push ID" | git hash-object -w --stdin) echo "PUSH_OBJ: $PUSH_OBJ" # Initialize the push ID to 0. git notes add -m 0 $PUSH_OBJ # Simulate some commits and pushes. for i in 1 2 3; do echo $i > file$i git add file$i git commit -m "Added file$i" file$i PUSH_TAG=$(git notes show $PUSH_OBJ) PUSH_TAG=$((PUSH_TAG+1)) git notes add -f -m $PUSH_TAG $PUSH_OBJ git tag -m "push-$PUSH_TAG" push-$PUSH_TAG done # list commits with push tags git log --decorate=full Running the above shows a git log with the tags: commit a4ca4a0b54d5fb61a2dacbab5732d00cf8216029 (HEAD, tag: refs/tags/push-3, refs/heads/master) ... Added file3 commit e98e2669569d5cfb15bf4cd1f268507873bcd63f (tag: refs/tags/push-2) ... Added file2 commit 5c7f29107838b4af91fe6fa5c2fc5e3769b87bef (tag: refs/tags/push-1) ... Added file1 The above script is not transaction safe because it runs commands individually. In a real deployment, git hooks would be used and would rely on push locks to synchronize updates. Those hooks could also distribute ID updates to the submodules to keep them synchronized. Tom.
I don't think we should do any of that. It's too complicated -- and I don't see the reason to even do it. There's a need for the "llvm-project" repository -- that's been discussed plenty -- but where does the need for a separate "id" that must be pushed into all of the sub-projects come from? This is the first I've heard of that as a thing that needs to be done. There was a previous discussion about putting an sequential ID in the "llvm-project" repo commit messages (although, even that I'd say is unnecessary), but not anywhere else. On Thu, Jun 30, 2016 at 7:42 AM, Renato Golin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Now that we seem to be converging to an acceptable Git model, there > was only one remaining doubt, and that's how the trigger to update a > sequential ID will work. I've been in contact with GitHub folks, and > this is in line with their suggestions... > > Given the nature of our project's repository structure, triggers in > each repository can't just update their own sequential ID (like > Gerrit) because we want a sequence in order for the whole project, not > just each component. But it's clear to me that we have to do something > similar to Gerrit, as this has been proven to work on a larger > infrastructure. > > Adding an incremental "Change-ID" to the commit message should > suffice, in the same way we have for SVN revisions now, if we can > guarantee that: > > 1. The ID will be unique across *all* projects > 2. Earlier pushes will get lower IDs than later ones > > Other things are not important: > > 3. We don't need the ID space to be complete (ie, we can jump from > 123 to 125 if some error happens) > 4. We don't need an ID for every "commit", but for every push. A > multi-commit push is a single feature, and doing so will help > buildbots build the whole set as one change. Reverts should also be > done in one go. > > What's left for the near future: > > 5. We don't yet handle multi-repository patch-sets. A way to > implement this is via manual Change-ID manipulation (explained below). > Not hard, but not a priority. > > > Design decisions > > This could be a pre/post-commit trigger on each repository that > receives an ID from somewhere (TBD) and updates the commit message. > When the umbrella project synchronises, it'll already have the > sequential number in. In this case, the umbrella project is not > necessary for anything other than bisect, buildbots and releases. > > I personally believe that having the trigger in the umbrella project > will be harder to implement and more error prone. > > The server has to have some kind of locking mechanism. Web services > normally spawn dozens of "listeners", meaning multiple pushes won't > fail to get a response, since the lock will be further down, after the > web server. > > Therefore, the lock for the unique increment ID has to be elsewhere. > The easiest thing I can think of is a SQL database with auto-increment > ID. Example: > > Initially: > sql> create table LLVM_ID ( id int not null primary key > auto_increment, repository varchar not null, hash varchar nut null ); > sql> alter table LLVM_ID auto_increment = 300000; > > On every request: > sql> insert into LLVM_ID values ("$repo_name", "$hash"); > sql> select_last_inset_id(); -> return > > and then print the "last insert id" back to the user in the body of > the page, so the hook can update the Change-id on the commit message. > The repo/hash info is more for logging, debugging and conflict > resolution purposes. > > We also must limit the web server to only accept connections from > GitHub's servers, to avoid abuse. Other repos in GitHub could still > abuse, and we can go further if it becomes a problem, but given point > (3) above, we may fix that only if it does happen. > > This solution doesn't scale to multiple servers, nor helps BPC > planning. Given the size of our needs, it not relevant. > > > Problems > > If the server goes down, given point (3), we may not be able to > reproduce locally the same sequence as the server would. Meaning > SVN-based bisects and releases would not be possible during down > times. But Git bisect and everything else would. > > Furthermore, even if a local script can't reproduce exactly what the > server would do, it still can make it linear for bisect purposes, > fixing the local problem. I can't see a situation in which we need the > sequence for any other purpose. > > Upstream and downstream releases can easily wait a day or two in the > unlucky situation that the server goes down in the exact time the > release will be branched. > > Migrations and backups also work well, and if we use some cloud > server, we can easily take snapshots every week or so, migrate images > across the world, etc. We don't need duplication, read-only scaling, > multi-master, etc., since only the web service will be writing/reading > from it. > > All in all, a "robust enough" solution for our needs. > > > Bundle commits > > Just FYI, here's a proposal that appeared in the "commit message > format" round of emails a few months ago, and that can work well for > bundling commits together, but will need more complicated SQL > handling. > > The current proposal is to have one ID per push. This is easy by using > auto_increment. But if we want to have one ID per multiple pushes, on > different repositories, we'll need to have the same ID on two or more > "repo/hash" pairs. > > On the commit level, the developer adds a temporary hash, possibly > generated by a local script in 'utils'. Example: > > Commit-ID: 68bd83f69b0609942a0c7dc409fd3428 > > This ID will have to be the same on both (say) LLVM and Clang commits. > > The script will then take that hash, generate an ID, and then if it > receives two or more pushes with such hashes, it'll return the *same* > ID, say 123456, in which case the Git hooks on all projects will > update the commit message by replacing the original Commit-ID to: > > Commit-ID: 123456 > > To avoid hash clashes in the future, the server script can refuse > existing hashes that are a few hours old and return error, in which > case the developer generates a new hash, update all commit messages > and re-push. > > If there is no Commit-ID, or if it's empty, we just insert a new empty > line, get the auto increment ID and return. Meaning, empty Commit-IDs > won't "match" any other. > > To solve this on the server side, a few ways are possible: > > A. We stop using primary_key auto_increment, handle the increment in > the script and use SQL transactions. > > This would be feasible, but more complex and error prone. I suggest we > go down that route only if keeping the repo/hash information is really > important. > > B. We ditch keeping record of repo/hash and just re-use the ID, but > record the original string, so we can match later. > > This keeps it simple and will work for our purposes, but we'll lose > the ability to debug problems if they happen in the future. > > C. We improve the SQL design to have two tables: > > LLVM_ID: > * ID: int PK auto > * Key: varchar null > > LLVM_PUSH: > * LLVM_ID: int FK (LLVM_ID:ID) > * Repo: varchar not null > * Push: varchar not null > > Every new push updates both tables, returns the ID. Pushes with the > same Key re-use the ID and update only LLVM_PUSH, returns the same ID. > > This is slightly more complicated, will need to code scripts to gather > information (for logging, debug), but give us both benefits > (debug+auto_increment) in one package. As a start, I'd recommend we > take this route even before the script supports it. But it could be > simple enough that we add support for it right from the beginning. > > I vote for option C. > > > Deployment > > I recommend we code this, setup a server, let it running for a while > on our current mirrors *before* we do the move. A simple plan is to: > > * Develop the server, hooks and set it running without updating the > commit message. > * We follow the logs, make sure everything is sane > * Change the hook to start updating the commit message > * We follow the commit messages, move some buildbots to track GitHub > (SVN still master) > * When all bots are live tracking GitHub and all developers have moved, we > flip. > > Sounds good? > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160630/f77f21a1/attachment.html>
Reid Kleckner via llvm-dev
2016-Jun-30 16:33 UTC
[llvm-dev] [lldb-dev] Sequential ID Git hook
On Thu, Jun 30, 2016 at 9:16 AM, James Y Knight via lldb-dev < lldb-dev at lists.llvm.org> wrote:> I don't think we should do any of that. It's too complicated -- and I > don't see the reason to even do it. > > There's a need for the "llvm-project" repository -- that's been discussed > plenty -- but where does the need for a separate "id" that must be pushed > into all of the sub-projects come from? This is the first I've heard of > that as a thing that needs to be done. > > There was a previous discussion about putting an sequential ID in the > "llvm-project" repo commit messages (although, even that I'd say is > unnecessary), but not anywhere else. >Agreed, the llvm-project repository can completely take on the role of the SQL database in Renato's proposal. Chromium created a "git-number" extension that assigns sequential ids to commits in the obvious way, and that provided some continuity with the "git-svn-id:" footers in commit messages. I'm not sure their extension is particularly reusable, though: https://chromium.googlesource.com/chromium/tools/depot_tools.git/+/master/git_number.py I think for LLVM, whatever process updates the umbrella repo should add the sequential IDs to the commit message, and that will help provide continuity across the git/svn bridge. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160630/9a187661/attachment.html>
Quick re-cap. After a few rounds, not only the "external server" proposal got obliterated as totally unnecessary, but the idea that we may even need a hook at all is now challenged. Jared's idea to use "git describe" is in line with previous proposals to use rev-list --count and to do so only up to the previous tag, but all in one nice and standard little feature. There were concerns by applying one tag per commit, but most of them offered weak evidence. However, if "describe" can cover all our needs, there is no point in even discussing tags. Just for reference, GitHub *does* have an SVN interface [1], and you can already checkout a specific revision with "svn checkout -r NNN repo", which *is already* using "git rev-list --count". This means that, for SVN based bisects, using GitHub will make it *completely transparent* for SVN users. You can also base your releases off an SVN view of the Git repo. So, to clear up this discussion and finish my proposal to move to GitHub, my final questions, only to those that *want* SVN compatibility: 1. Is there anything in the SVN view of GitHub that *doesn't* work for you? (ie. same as using "rev-list --count") 2. If so, can "git describe" solve the problem? 3. If not, please describe, in details, why <<your alternative solution>> would be the *only* way forward. I'll let this sit for a few days, and if no one has any serious issue, I'll write up the final proposal and start the voting process with the Foundation. cheers, --renato [1] https://github.com/blog/626-announcing-svn-support
David Chisnall via llvm-dev
2016-Jul-05 11:13 UTC
[llvm-dev] [llvm-foundation] Sequential ID Git hook
On 5 Jul 2016, at 11:44, Renato Golin via llvm-foundation <llvm-foundation at lists.llvm.org> wrote:> > Just for reference, GitHub *does* have an SVN interface [1], and you > can already checkout a specific revision with "svn checkout -r NNN > repo", which *is already* using "git rev-list --count". > > This means that, for SVN based bisects, using GitHub will make it > *completely transparent* for SVN users. You can also base your > releases off an SVN view of the Git repo.Note that GitHub (currently, at least) doesn’t export submodules sensibly with their svn version. I don’t intend to use the svn thing (the only time that I have used it in anger was to replace a project that moved to GitHub with an svn:external that referred to the GitHub repo so people could easily find the new location), but that would cause problems if anyone wants to do an svn bisect. I think it would help to have a description of how to bisect for a clang or lldb (or some other subproject) regression. For downstream users, it would also be nice if tools like git-imerge let you merge clang and llvm together, though that’s a nice-to-have feature that we currently lack so shouldn’t in any way block the migration. David -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3719 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160705/2300af6b/attachment.bin>
> On Jul 5, 2016, at 3:44 AM, Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Quick re-cap. > > After a few rounds, not only the "external server" proposal got > obliterated as totally unnecessary, but the idea that we may even need > a hook at all is now challenged.This is not clear to me. How is the umbrella repository updated? — Mehdi> > Jared's idea to use "git describe" is in line with previous proposals > to use rev-list --count and to do so only up to the previous tag, but > all in one nice and standard little feature. > > There were concerns by applying one tag per commit, but most of them > offered weak evidence. However, if "describe" can cover all our needs, > there is no point in even discussing tags. > > Just for reference, GitHub *does* have an SVN interface [1], and you > can already checkout a specific revision with "svn checkout -r NNN > repo", which *is already* using "git rev-list --count". > > This means that, for SVN based bisects, using GitHub will make it > *completely transparent* for SVN users. You can also base your > releases off an SVN view of the Git repo. > > So, to clear up this discussion and finish my proposal to move to > GitHub, my final questions, only to those that *want* SVN > compatibility: > > 1. Is there anything in the SVN view of GitHub that *doesn't* work for > you? (ie. same as using "rev-list --count") > > 2. If so, can "git describe" solve the problem? > > 3. If not, please describe, in details, why <<your alternative > solution>> would be the *only* way forward. > > I'll let this sit for a few days, and if no one has any serious issue, > I'll write up the final proposal and start the voting process with the > Foundation. > > cheers, > --renato > > [1] https://github.com/blog/626-announcing-svn-support > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev