Hi, On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC, whilst attempting to upgrade the underlying storage. This was a semi-planned outage, which we'd hoped would last for approximately 30min. During the outage, the GitLab web UI and API, as well as HTTPS git clones through https://gitlab.freedesktop.org, were completely unavailable, giving connection timeout errors. anongit and cgit remained completely functional. There was no data loss. The outage was 'semi-planned' in that it was only announced a couple of hours in advance. It was also scheduled at one of the worst possible times: whilst all of Europe and the American east coast is active, with the west coast beginning to come online, and some of Asia still being online. Most of our outages happen early in the European morning, when we see the lightest usage from only eastern Europe and Asia being online, and only last for approximately five minutes. Background ---------------- gitlab.freedesktop.org runs on the Google Cloud Platform, using the Google Kubernetes Engine and Helm charts[0]. The cluster currently runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes Pod, using the latest published GitLab CE image from gitlab.org (at time of writing this is 11.1.2, however at the time it was 11.0.4). Some GitLab data is stored in Google Cloud Storage buckets, including CI job artifacts and traces, file uploads, Git LFS data, and backups. The Git repositories themselves are stored inside a master data partition which is exposed to the container as a local filesystem through a Kubernetes PersistentVolumeClaim; other data is stored inside PostgreSQL which is again a Kubernetes PVC. Repository storage is (currently) just regular Git repositories. Forking a repository simply calls 'git clone' with no sharing of objects; storage is not deduplicated across forks. Kubernetes persistent volumes are not currently resizeable. There is alpha support in 1.10.x, scheduled to become general availability shortly. Backups are executed from a daily cronjob, set to run at 5am UTC: this executes a Rake Rails task inside the master GitLab pod. The backups cover all data, _except_ that which is already stored in GCS. Due to legacy reasons, backups are made by first capturing all the data to be backed up into a local directory; an uncompressed tarball is then created in the local directory which is uploaded to storage. This means that the directory used for backups must have a bit over twice the size of the final backup available to it as free space. Events ------------ Shortly after 9am UTC, it became clear that the disk space situation on our GitLab cluster was critical. Some API requests were failing due to a shortage of disk space. A quick investigation showed this was due to a large number of recently-created forks of Mesa in particular, which requires ~2.1GB of storage per repository. The backup cron job had also started failing for the same reason. This made resolution quite critical: not only did we not have backups, but we were only one new Mesa or kernel fork away from totally exhausting our complete disk space, and potentially exposing users to data loss: e.g. trying to enter a new issue and having this fail, or being unable to push to their repositories. Before 10am UTC, it was announced that we would need urgent downtime later in the day, in order to expand our disk volumes. At this point, I spent a good deal of time researching Kubernetes persistent-volume resizing (something I'd previously looked into for when the situation arose), and doing trial migrations on a scratch Kubernetes cluster. At 1pm UTC, I announced that there would be a one-hour outage window in order to do this, from 2-2:30pm UTC. At 2pm UTC, I announced the window had begun, and started the yak-shave. Firstly, I modified the firewall rules to drop all inbound traffic to GitLab other than that from the IP I was working from, so others did not see any transient failures but instead just a connection timeout. It also ensured backup integrity: that we would be able to snapshot the data at a given point without worrying about losing changes made after that point. I took a manual backup of all the GitLab data (minus what was on GCS): this consisted of letting the usual backup Rake task run to the point of collecting all the data but stopped before running tar, as running tar would've exceed the disk space and killed the cluster. Instead, I ran 'tar' with its output streamed over SSH to an encrypted partition on a local secure machine. Secondly, I took snapshots of all the data disks. Each Kubernetes PersistentVolume is backed by a Google Compute Engine disk, which can be manually snapshotted and tagged. Both of these steps took much longer than planned. The backup task was taking much longer than it had historically - in hindsight, it should've been clear that with one of our problems being a huge increase in backup size, that both generating and copying the backups would take far longer than they previously had. At this point, I announced an extension of the outage window from 2-3pm UTC. Snapshotting the disks also took far longer than hoped. I was working through Google's Cloud Console web UI, which can be infuriatingly slow: after a (not overly quick) page load, there is a good few seconds' wait whilst the data loads asynchronously and then populates page content. Working through to determine which disk was which, taking snapshots of its content and tagging those snapshots, took some time. This was compounded by not being familiar with the disk snapshot UI, and by an abundance of caution: I checked multiple times that we did in fact have storage of all the relevant volumes. After this was done, I upgraded the Kubernetes cluster to 1.10.x, and attempted to resize the persistent volumes, which immediately failed. It became clear at this point that I had missed two crucial details. Firstly, that it was not possible to make a static-sized disk resizeable on the fly: it would require destroying and then recreating the PersistentVolumes, then restoring the data on to those: either via restoring a backup image, or by simply copying the old content to the new volumes. Secondly, it became clear that Google's Kubernetes engine did _not_ in fact provide support for resizing disks, as it was an alpha feature which was not possible to enable. At this point I made sure the old persistent volumes would, in fact, persist after they had been orphaned. This gave us three copies of the data: in a local backup, in GCE disk snapshots, and in retention of the GCE disks themselves. I then spent some time figuring out how to pause service availability, so we could make the new disks available to be used by the cluster, without actually starting the services with a clean slate. This took a surprising amount of time, and was somewhat fiddly. During this time I also experienced a new failure mode of how we run Helm: that if some resources were unavailable (due to a typo), it would block indefinitely for them to become available (never) rather than fail immediately. I had long since previously started the process of copying the backups back towards the cluster, so that we had the option of restoring from backup if that was a good idea. However, at this point I started having serious degradation of my network connection: not only did my upload speed vary wildly (factor of 100x), but due to local issues the workstation I was using spent some time refusing to route HTTPS traffic to the Kubernetes control API. Much time was spent debugging and resolving this. The preferred option was to restore the previous snapshots into new disks: this meant we did not have to block on the backup upload and could be sure that we had exactly the same content as previously. I started pursuing this option: once I had ensured that the new Kubernetes persistent volumes had created new GCE disks, I attempted to restore the disk snapshots into them. At this point, I discovered the difference between GCE disks, disk images, and disk snapshots. It is not possible to directly restore a snapshot into a live disk: you must mount both the target disk and the source snapshot in a new GCE VM, boot the VM, and copy between the two. I did this with new GCE disks, and attempted to use those disks as backing storage for new Kubernetes PersistentVolumes that we could reuse directly. More time was lost due to the Helm failures above. Eventually when we got there, I discovered that creating a new Kubernetes PV/PVC from a GCE disk will obliterate all the content on that disk, so that avenue was useless. Quite some hours into the outage, I decided to take a fifteen-minute break, go for a walk outside and try to reason about what was going on and what we should do next. Coming back, I pursued the last good option: stop the Kubernetes services completely, attach botoh the new enlarged PVs and the old disk snapshots to a new ephemeral GCE VM, copy directly between those, stop the GCE VM, and restart Kubernetes. This mostly succeeded, except for subvolumes. Kubernetes exposes '$disk/mysql/' as the root mountpoint for the MySQL data volume, whereas mounting the raw disk exposes '$disk' as the root mountpoint. The copy didn't correctly preserve the subdirectory, so though we had the Git data accessible, MySQL was seeing an empty database. To avoid any desynchronisation, I destroyed all the resources again, created completely new and empty volumes, created a new GCE VM, and re-did the copy with the correct directory structure. Coming back up, I manually verified that the list of repositories and users was the same as previously, worked through parts of the UI (e.g. could I still search for issues after not restoring the Redis cache?) and a few typical workflows. I had also started a full backup task in the background to ensure that our backup cron job would succeed in the morning without needing another outage. Once this was done, around 18:45 UTC, I restored public access and announced the end of the outage. A couple of days later, I spent some time cleaning up all the ephemeral resources created during the outage (persistent volumes, disks, disk snapshots, VMs, etc). What went badly ----------------------- Many things. The first we realised things were wrong was when people mentioned the failure on IRC. Setting up a system (probably based on Prometheus/Grafana, as this is recommend by upstream, integrates well with the services, and has a good feature set) to capture key metrics like disk usage, API error rate, etc, and alert via email/IRC when these hit error thresholds, is a high-priority but also high-time task. Doing it myself requires learning a fair few new things, and also downtime (see below) whilst deploying it. So far I have not had a large enough timeslot. There is also a single point of failure: I am the only administrator who works on GitLab. Though Tollef and Keith have access to the Google organisation and could do so, they don't have the time. If I were not available, either they would have to go through the process of setting up their accounts to control Kubernetes and familiarising themselves with our deployment. This is obviously bad, especially as I am relatively new to administering Kubernetes (as seen from the failures in the timeline). The length of the backup task completely blew out our outage window. It should've been obvious that backups would take longer than they had previously; even if not, we could've run a test task to measure this before we announced an outage window which could never have been met. My internet connection choosing that exact afternoon to be extremely unreliable was quite unhelpful. If any of this was planned I would've been somewhere with a much faster and more stable connection, but unfortunately I didn't have the choice. Though I'd tested some of these changes throughout the morning, I hadn't tested the exact changes. I'm not sure how it is possible to test some of them (e.g. how do I, with a scratch cluster, test that persistent volumes created with a several-versions-old Kubernetes will upgrade ... ?), but certainly I could've at least made the tests a little more thorough and realistic, particularly for things like the disk snapshots. What went well --------------------- No data loss: we had backups, and lots of them (at least three at every point from when destructive operations began). There was no danger at any point of data being lost for good, though some of those options required unacceptably long downtime when used as a source. Data being on GCS: having much of our data in cloud storage long delayed the point we needed to What we can do in future ---------------------------------- Monitoring, logging, and alerts: https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the task I've long ago filed to get this enabled. If anyone reading this is familiar with the landscape and can help, I would be immensely grateful for it. Better communication: due to its last-minute nature, the outage was only announced on IRC and not to the lists; I also didn't inform the lists when the outage was dragging on, as I was so focused on what I needed to do to bring it back up. This was obviously a bad idea. We should look into how we can communicate these kinds of things a lot better. Cloud Console web UI: the web UI is borderline unusable for interactive operation, due to long load times between pages and a lot of inter-page navigation required to achieve anything. It would be better to get more familiar with driving the console clients in anger, and use those where possible. More admins: having myself as the single GitLab admin is not in any way sustainable, and we need to enlarge our group of admins. Having someone actually familiar with Kubernetes deployments in particular, would be a massive advantage. I'm learning on the spot from online documentation, which given the speed that Kubernetes development moves at, is often either useless or misleading. Move away from Omnibus container: currently, as said, every service behind gitlab.fd.o is deployed into a single Kubernetes pod, with PostgreSQL and Redis in linked containers. The GitLab container is called the 'omnibus' container, combining the web/API hosts, background job processors, Git repository service, SSH access, Pages server, etc. The container is a huge download, and on start it runs Chef which takes approximately 2-3 minutes to configure the filesystem before even thinking about starting GitLab. Total minimum downtime for every change is 4-5 minutes, and every change makes the whole of GitLab unavailable: this makes us really reluctant to change configuration unless necessary, giving us less experience with Kubernetes than we might otherwise have. GitLab upstream is working on a 'cloud native' deployment, splitting each service into its own container which does _not_ spend minutes running Omnibus at startup, which can be independently configured without impacting other services. Actually making this move will require multiple hours of downtime, which will need to be communicated long in advance. Move storage to resizeable volumes or GCS: at the next point where we exhaust our disk space, we're going to need to go through this again. Moving more of our storage to cloud storage, where we can, means that we don't have to worry about it. When Kubernetes 1.11.x becomes available through GCP, we can also recreate the disks as resizeable volumes, which will grow on demand, avoiding this whole process. At least we've learned quite a bit from doing it this time? Cheers, Daniel [0]: The Helm charts and configuration are available at https://gitlab.freedesktop.org/freedesktop/ with the exception of a repository containing various secrets (SSH keys, API keys, passwords, etc) which are grafted on to the existing chart.
We can also save space by using the main repo for private branches, e.g. my branches would be in refs/heads/mareko/*. Marek On Sun, Jul 29, 2018 at 9:41 AM, Daniel Stone <daniel at fooishbar.org> wrote:> Hi, > On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC, > whilst attempting to upgrade the underlying storage. This was a > semi-planned outage, which we'd hoped would last for approximately > 30min. > > During the outage, the GitLab web UI and API, as well as HTTPS git > clones through https://gitlab.freedesktop.org, were completely > unavailable, giving connection timeout errors. anongit and cgit > remained completely functional. There was no data loss. > > The outage was 'semi-planned' in that it was only announced a couple > of hours in advance. It was also scheduled at one of the worst > possible times: whilst all of Europe and the American east coast is > active, with the west coast beginning to come online, and some of Asia > still being online. > > Most of our outages happen early in the European morning, when we see > the lightest usage from only eastern Europe and Asia being online, and > only last for approximately five minutes. > > > Background > ---------------- > > gitlab.freedesktop.org runs on the Google Cloud Platform, using the > Google Kubernetes Engine and Helm charts[0]. The cluster currently > runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes > Pod, using the latest published GitLab CE image from gitlab.org (at > time of writing this is 11.1.2, however at the time it was 11.0.4). > > Some GitLab data is stored in Google Cloud Storage buckets, including > CI job artifacts and traces, file uploads, Git LFS data, and backups. > The Git repositories themselves are stored inside a master data > partition which is exposed to the container as a local filesystem > through a Kubernetes PersistentVolumeClaim; other data is stored > inside PostgreSQL which is again a Kubernetes PVC. > > Repository storage is (currently) just regular Git repositories. > Forking a repository simply calls 'git clone' with no sharing of > objects; storage is not deduplicated across forks. > > Kubernetes persistent volumes are not currently resizeable. There is > alpha support in 1.10.x, scheduled to become general availability > shortly. > > Backups are executed from a daily cronjob, set to run at 5am UTC: this > executes a Rake Rails task inside the master GitLab pod. The backups > cover all data, _except_ that which is already stored in GCS. Due to > legacy reasons, backups are made by first capturing all the data to be > backed up into a local directory; an uncompressed tarball is then > created in the local directory which is uploaded to storage. This > means that the directory used for backups must have a bit over twice > the size of the final backup available to it as free space. > > > Events > ------------ > > Shortly after 9am UTC, it became clear that the disk space situation > on our GitLab cluster was critical. Some API requests were failing due > to a shortage of disk space. A quick investigation showed this was due > to a large number of recently-created forks of Mesa in particular, > which requires ~2.1GB of storage per repository. > > The backup cron job had also started failing for the same reason. This > made resolution quite critical: not only did we not have backups, but > we were only one new Mesa or kernel fork away from totally exhausting > our complete disk space, and potentially exposing users to data loss: > e.g. trying to enter a new issue and having this fail, or being unable > to push to their repositories. > > Before 10am UTC, it was announced that we would need urgent downtime > later in the day, in order to expand our disk volumes. At this point, > I spent a good deal of time researching Kubernetes persistent-volume > resizing (something I'd previously looked into for when the situation > arose), and doing trial migrations on a scratch Kubernetes cluster. > > At 1pm UTC, I announced that there would be a one-hour outage window > in order to do this, from 2-2:30pm UTC. > > At 2pm UTC, I announced the window had begun, and started the yak-shave. > > Firstly, I modified the firewall rules to drop all inbound traffic to > GitLab other than that from the IP I was working from, so others did > not see any transient failures but instead just a connection timeout. > It also ensured backup integrity: that we would be able to snapshot > the data at a given point without worrying about losing changes made > after that point. > > I took a manual backup of all the GitLab data (minus what was on GCS): > this consisted of letting the usual backup Rake task run to the point > of collecting all the data but stopped before running tar, as running > tar would've exceed the disk space and killed the cluster. Instead, I > ran 'tar' with its output streamed over SSH to an encrypted partition > on a local secure machine. > > Secondly, I took snapshots of all the data disks. Each Kubernetes > PersistentVolume is backed by a Google Compute Engine disk, which can > be manually snapshotted and tagged. > > Both of these steps took much longer than planned. The backup task was > taking much longer than it had historically - in hindsight, it > should've been clear that with one of our problems being a huge > increase in backup size, that both generating and copying the backups > would take far longer than they previously had. At this point, I > announced an extension of the outage window from 2-3pm UTC. > > Snapshotting the disks also took far longer than hoped. I was working > through Google's Cloud Console web UI, which can be infuriatingly > slow: after a (not overly quick) page load, there is a good few > seconds' wait whilst the data loads asynchronously and then populates > page content. Working through to determine which disk was which, > taking snapshots of its content and tagging those snapshots, took some > time. This was compounded by not being familiar with the disk snapshot > UI, and by an abundance of caution: I checked multiple times that we > did in fact have storage of all the relevant volumes. > > After this was done, I upgraded the Kubernetes cluster to 1.10.x, and > attempted to resize the persistent volumes, which immediately failed. > It became clear at this point that I had missed two crucial details. > Firstly, that it was not possible to make a static-sized disk > resizeable on the fly: it would require destroying and then recreating > the PersistentVolumes, then restoring the data on to those: either via > restoring a backup image, or by simply copying the old content to the > new volumes. Secondly, it became clear that Google's Kubernetes engine > did _not_ in fact provide support for resizing disks, as it was an > alpha feature which was not possible to enable. > > At this point I made sure the old persistent volumes would, in fact, > persist after they had been orphaned. This gave us three copies of the > data: in a local backup, in GCE disk snapshots, and in retention of > the GCE disks themselves. I then spent some time figuring out how to > pause service availability, so we could make the new disks available > to be used by the cluster, without actually starting the services with > a clean slate. This took a surprising amount of time, and was somewhat > fiddly. During this time I also experienced a new failure mode of how > we run Helm: that if some resources were unavailable (due to a typo), > it would block indefinitely for them to become available (never) > rather than fail immediately. > > I had long since previously started the process of copying the backups > back towards the cluster, so that we had the option of restoring from > backup if that was a good idea. However, at this point I started > having serious degradation of my network connection: not only did my > upload speed vary wildly (factor of 100x), but due to local issues the > workstation I was using spent some time refusing to route HTTPS > traffic to the Kubernetes control API. Much time was spent debugging > and resolving this. > > The preferred option was to restore the previous snapshots into new > disks: this meant we did not have to block on the backup upload and > could be sure that we had exactly the same content as previously. I > started pursuing this option: once I had ensured that the new > Kubernetes persistent volumes had created new GCE disks, I attempted > to restore the disk snapshots into them. > > At this point, I discovered the difference between GCE disks, disk > images, and disk snapshots. It is not possible to directly restore a > snapshot into a live disk: you must mount both the target disk and the > source snapshot in a new GCE VM, boot the VM, and copy between the > two. I did this with new GCE disks, and attempted to use those disks > as backing storage for new Kubernetes PersistentVolumes that we could > reuse directly. More time was lost due to the Helm failures above. > Eventually when we got there, I discovered that creating a new > Kubernetes PV/PVC from a GCE disk will obliterate all the content on > that disk, so that avenue was useless. > > Quite some hours into the outage, I decided to take a fifteen-minute > break, go for a walk outside and try to reason about what was going on > and what we should do next. > > Coming back, I pursued the last good option: stop the Kubernetes > services completely, attach botoh the new enlarged PVs and the old > disk snapshots to a new ephemeral GCE VM, copy directly between those, > stop the GCE VM, and restart Kubernetes. This mostly succeeded, except > for subvolumes. Kubernetes exposes '$disk/mysql/' as the root > mountpoint for the MySQL data volume, whereas mounting the raw disk > exposes '$disk' as the root mountpoint. The copy didn't correctly > preserve the subdirectory, so though we had the Git data accessible, > MySQL was seeing an empty database. > > To avoid any desynchronisation, I destroyed all the resources again, > created completely new and empty volumes, created a new GCE VM, and > re-did the copy with the correct directory structure. Coming back up, > I manually verified that the list of repositories and users was the > same as previously, worked through parts of the UI (e.g. could I still > search for issues after not restoring the Redis cache?) and a few > typical workflows. I had also started a full backup task in the > background to ensure that our backup cron job would succeed in the > morning without needing another outage. > > Once this was done, around 18:45 UTC, I restored public access and > announced the end of the outage. > > A couple of days later, I spent some time cleaning up all the > ephemeral resources created during the outage (persistent volumes, > disks, disk snapshots, VMs, etc). > > > What went badly > ----------------------- > > Many things. > > The first we realised things were wrong was when people mentioned the > failure on IRC. Setting up a system (probably based on > Prometheus/Grafana, as this is recommend by upstream, integrates well > with the services, and has a good feature set) to capture key metrics > like disk usage, API error rate, etc, and alert via email/IRC when > these hit error thresholds, is a high-priority but also high-time > task. Doing it myself requires learning a fair few new things, and > also downtime (see below) whilst deploying it. So far I have not had a > large enough timeslot. > > There is also a single point of failure: I am the only administrator > who works on GitLab. Though Tollef and Keith have access to the Google > organisation and could do so, they don't have the time. If I were not > available, either they would have to go through the process of setting > up their accounts to control Kubernetes and familiarising themselves > with our deployment. This is obviously bad, especially as I am > relatively new to administering Kubernetes (as seen from the failures > in the timeline). > > The length of the backup task completely blew out our outage window. > It should've been obvious that backups would take longer than they had > previously; even if not, we could've run a test task to measure this > before we announced an outage window which could never have been met. > > My internet connection choosing that exact afternoon to be extremely > unreliable was quite unhelpful. If any of this was planned I would've > been somewhere with a much faster and more stable connection, but > unfortunately I didn't have the choice. > > Though I'd tested some of these changes throughout the morning, I > hadn't tested the exact changes. I'm not sure how it is possible to > test some of them (e.g. how do I, with a scratch cluster, test that > persistent volumes created with a several-versions-old Kubernetes will > upgrade ... ?), but certainly I could've at least made the tests a > little more thorough and realistic, particularly for things like the > disk snapshots. > > > What went well > --------------------- > > No data loss: we had backups, and lots of them (at least three at > every point from when destructive operations began). There was no > danger at any point of data being lost for good, though some of those > options required unacceptably long downtime when used as a source. > > Data being on GCS: having much of our data in cloud storage long > delayed the point we needed to > > > What we can do in future > ---------------------------------- > > Monitoring, logging, and alerts: > https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the > task I've long ago filed to get this enabled. If anyone reading this > is familiar with the landscape and can help, I would be immensely > grateful for it. > > Better communication: due to its last-minute nature, the outage was > only announced on IRC and not to the lists; I also didn't inform the > lists when the outage was dragging on, as I was so focused on what I > needed to do to bring it back up. This was obviously a bad idea. We > should look into how we can communicate these kinds of things a lot > better. > > Cloud Console web UI: the web UI is borderline unusable for > interactive operation, due to long load times between pages and a lot > of inter-page navigation required to achieve anything. It would be > better to get more familiar with driving the console clients in anger, > and use those where possible. > > More admins: having myself as the single GitLab admin is not in any > way sustainable, and we need to enlarge our group of admins. Having > someone actually familiar with Kubernetes deployments in particular, > would be a massive advantage. I'm learning on the spot from online > documentation, which given the speed that Kubernetes development moves > at, is often either useless or misleading. > > Move away from Omnibus container: currently, as said, every service > behind gitlab.fd.o is deployed into a single Kubernetes pod, with > PostgreSQL and Redis in linked containers. The GitLab container is > called the 'omnibus' container, combining the web/API hosts, > background job processors, Git repository service, SSH access, Pages > server, etc. The container is a huge download, and on start it runs > Chef which takes approximately 2-3 minutes to configure the filesystem > before even thinking about starting GitLab. Total minimum downtime for > every change is 4-5 minutes, and every change makes the whole of > GitLab unavailable: this makes us really reluctant to change > configuration unless necessary, giving us less experience with > Kubernetes than we might otherwise have. GitLab upstream is working on > a 'cloud native' deployment, splitting each service into its own > container which does _not_ spend minutes running Omnibus at startup, > which can be independently configured without impacting other > services. Actually making this move will require multiple hours of > downtime, which will need to be communicated long in advance. > > Move storage to resizeable volumes or GCS: at the next point where we > exhaust our disk space, we're going to need to go through this again. > Moving more of our storage to cloud storage, where we can, means that > we don't have to worry about it. When Kubernetes 1.11.x becomes > available through GCP, we can also recreate the disks as resizeable > volumes, which will grow on demand, avoiding this whole process. At > least we've learned quite a bit from doing it this time? > > > Cheers, > Daniel > > [0]: The Helm charts and configuration are available at > https://gitlab.freedesktop.org/freedesktop/ with the exception of a > repository containing various secrets (SSH keys, API keys, passwords, > etc) which are grafted on to the existing chart. > _______________________________________________ > freedesktop mailing list > freedesktop at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/freedesktop
Marek Olšák <maraeo at gmail.com> writes:> We can also save space by using the main repo for private branches, > e.g. my branches would be in refs/heads/mareko/*.It sounds like gitlab is not going to have this fixed very soon: https://gitlab.com/gitlab-org/gitlab-ce/issues/23029 I think this is an interesting idea. It would increase exposure of Mesa developers to what each other is doing, without everyone needing to go star each other's repos. I think Mesa has been a little too cautious with branches in the main repo -- we very rarely push branches, and we're definitely bad at deleting those old branch heads once they're no longer relevant as branches for development. On the other hand, I'm not sure everyone wants to see every weird unfinished branch I have, and a personal repo is nice for hanging on to those. Once we have gitlab CI hooked up (hopefully in the next week or so), I want developers to be able to push to gitlab and have CI go through the patch, before they actually submit to the ML. If we're pushing to branches on origin, that could potentially be a lot of mailing list spam. I'd want some sort of solution for that before we start telling people to just put their stuff in user branches on the central repo instead of personal forks. People can also be a lot more cavalier with git push personalrepo +branch than I'd like to have them be with git push origin +branch. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 832 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/freedesktop/attachments/20180801/858bf040/attachment.sig>