thr3ads.net - freedesktop - [fdo] Postmortem: July 17th GitLab outage [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Daniel Stone

2018-Jul-29 13:41 UTC

[fdo] Postmortem: July 17th GitLab outage

Hi,
On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC,
whilst attempting to upgrade the underlying storage. This was a
semi-planned outage, which we'd hoped would last for approximately
30min.

During the outage, the GitLab web UI and API, as well as HTTPS git
clones through https://gitlab.freedesktop.org, were completely
unavailable, giving connection timeout errors. anongit and cgit
remained completely functional. There was no data loss.

The outage was 'semi-planned' in that it was only announced a couple
of hours in advance. It was also scheduled at one of the worst
possible times: whilst all of Europe and the American east coast is
active, with the west coast beginning to come online, and some of Asia
still being online.

Most of our outages happen early in the European morning, when we see
the lightest usage from only eastern Europe and Asia being online, and
only last for approximately five minutes.


Background
----------------

gitlab.freedesktop.org runs on the Google Cloud Platform, using the
Google Kubernetes Engine and Helm charts[0]. The cluster currently
runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes
Pod, using the latest published GitLab CE image from gitlab.org (at
time of writing this is 11.1.2, however at the time it was 11.0.4).

Some GitLab data is stored in Google Cloud Storage buckets, including
CI job artifacts and traces, file uploads, Git LFS data, and backups.
The Git repositories themselves are stored inside a master data
partition which is exposed to the container as a local filesystem
through a Kubernetes PersistentVolumeClaim; other data is stored
inside PostgreSQL which is again a Kubernetes PVC.

Repository storage is (currently) just regular Git repositories.
Forking a repository simply calls 'git clone' with no sharing of
objects; storage is not deduplicated across forks.

Kubernetes persistent volumes are not currently resizeable. There is
alpha support in 1.10.x, scheduled to become general availability
shortly.

Backups are executed from a daily cronjob, set to run at 5am UTC: this
executes a Rake Rails task inside the master GitLab pod. The backups
cover all data, _except_ that which is already stored in GCS. Due to
legacy reasons, backups are made by first capturing all the data to be
backed up into a local directory; an uncompressed tarball is then
created in the local directory which is uploaded to storage. This
means that the directory used for backups must have a bit over twice
the size of the final backup available to it as free space.


Events
------------

Shortly after 9am UTC, it became clear that the disk space situation
on our GitLab cluster was critical. Some API requests were failing due
to a shortage of disk space. A quick investigation showed this was due
to a large number of recently-created forks of Mesa in particular,
which requires ~2.1GB of storage per repository.

The backup cron job had also started failing for the same reason. This
made resolution quite critical: not only did we not have backups, but
we were only one new Mesa or kernel fork away from totally exhausting
our complete disk space, and potentially exposing users to data loss:
e.g. trying to enter a new issue and having this fail, or being unable
to push to their repositories.

Before 10am UTC, it was announced that we would need urgent downtime
later in the day, in order to expand our disk volumes. At this point,
I spent a good deal of time researching Kubernetes persistent-volume
resizing (something I'd previously looked into for when the situation
arose), and doing trial migrations on a scratch Kubernetes cluster.

At 1pm UTC, I announced that there would be a one-hour outage window
in order to do this, from 2-2:30pm UTC.

At 2pm UTC, I announced the window had begun, and started the yak-shave.

Firstly, I modified the firewall rules to drop all inbound traffic to
GitLab other than that from the IP I was working from, so others did
not see any transient failures but instead just a connection timeout.
It also ensured backup integrity: that we would be able to snapshot
the data at a given point without worrying about losing changes made
after that point.

I took a manual backup of all the GitLab data (minus what was on GCS):
this consisted of letting the usual backup Rake task run to the point
of collecting all the data but stopped before running tar, as running
tar would've exceed the disk space and killed the cluster. Instead, I
ran 'tar' with its output streamed over SSH to an encrypted partition
on a local secure machine.

Secondly, I took snapshots of all the data disks. Each Kubernetes
PersistentVolume is backed by a Google Compute Engine disk, which can
be manually snapshotted and tagged.

Both of these steps took much longer than planned. The backup task was
taking much longer than it had historically - in hindsight, it
should've been clear that with one of our problems being a huge
increase in backup size, that both generating and copying the backups
would take far longer than they previously had. At this point, I
announced an extension of the outage window from 2-3pm UTC.

Snapshotting the disks also took far longer than hoped. I was working
through Google's Cloud Console web UI, which can be infuriatingly
slow: after a (not overly quick) page load, there is a good few
seconds' wait whilst the data loads asynchronously and then populates
page content. Working through to determine which disk was which,
taking snapshots of its content and tagging those snapshots, took some
time. This was compounded by not being familiar with the disk snapshot
UI, and by an abundance of caution: I checked multiple times that we
did in fact have storage of all the relevant volumes.

After this was done, I upgraded the Kubernetes cluster to 1.10.x, and
attempted to resize the persistent volumes, which immediately failed.
It became clear at this point that I had missed two crucial details.
Firstly, that it was not possible to make a static-sized disk
resizeable on the fly: it would require destroying and then recreating
the PersistentVolumes, then restoring the data on to those: either via
restoring a backup image, or by simply copying the old content to the
new volumes. Secondly, it became clear that Google's Kubernetes engine
did _not_ in fact provide support for resizing disks, as it was an
alpha feature which was not possible to enable.

At this point I made sure the old persistent volumes would, in fact,
persist after they had been orphaned. This gave us three copies of the
data: in a local backup, in GCE disk snapshots, and in retention of
the GCE disks themselves. I then spent some time figuring out how to
pause service availability, so we could make the new disks available
to be used by the cluster, without actually starting the services with
a clean slate. This took a surprising amount of time, and was somewhat
fiddly. During this time I also experienced a new failure mode of how
we run Helm: that if some resources were unavailable (due to a typo),
it would block indefinitely for them to become available (never)
rather than fail immediately.

I had long since previously started the process of copying the backups
back towards the cluster, so that we had the option of restoring from
backup if that was a good idea. However, at this point I started
having serious degradation of my network connection: not only did my
upload speed vary wildly (factor of 100x), but due to local issues the
workstation I was using spent some time refusing to route HTTPS
traffic to the Kubernetes control API. Much time was spent debugging
and resolving this.

The preferred option was to restore the previous snapshots into new
disks: this meant we did not have to block on the backup upload and
could be sure that we had exactly the same content as previously. I
started pursuing this option: once I had ensured that the new
Kubernetes persistent volumes had created new GCE disks, I attempted
to restore the disk snapshots into them.

At this point, I discovered the difference between GCE disks, disk
images, and disk snapshots. It is not possible to directly restore a
snapshot into a live disk: you must mount both the target disk and the
source snapshot in a new GCE VM, boot the VM, and copy between the
two. I did this with new GCE disks, and attempted to use those disks
as backing storage for new Kubernetes PersistentVolumes that we could
reuse directly. More time was lost due to the Helm failures above.
Eventually when we got there, I discovered that creating a new
Kubernetes PV/PVC from a GCE disk will obliterate all the content on
that disk, so that avenue was useless.

Quite some hours into the outage, I decided to take a fifteen-minute
break, go for a walk outside and try to reason about what was going on
and what we should do next.

Coming back, I pursued the last good option: stop the Kubernetes
services completely, attach botoh the new enlarged PVs and the old
disk snapshots to a new ephemeral GCE VM, copy directly between those,
stop the GCE VM, and restart Kubernetes. This mostly succeeded, except
for subvolumes. Kubernetes exposes '$disk/mysql/' as the root
mountpoint for the MySQL data volume, whereas mounting the raw disk
exposes '$disk' as the root mountpoint. The copy didn't correctly
preserve the subdirectory, so though we had the Git data accessible,
MySQL was seeing an empty database.

To avoid any desynchronisation, I destroyed all the resources again,
created completely new and empty volumes, created a new GCE VM, and
re-did the copy with the correct directory structure. Coming back up,
I manually verified that the list of repositories and users was the
same as previously, worked through parts of the UI (e.g. could I still
search for issues after not restoring the Redis cache?) and a few
typical workflows. I had also started a full backup task in the
background to ensure that our backup cron job would succeed in the
morning without needing another outage.

Once this was done, around 18:45 UTC, I restored public access and
announced the end of the outage.

A couple of days later, I spent some time cleaning up all the
ephemeral resources created during the outage (persistent volumes,
disks, disk snapshots, VMs, etc).


What went badly
-----------------------

Many things.

The first we realised things were wrong was when people mentioned the
failure on IRC. Setting up a system (probably based on
Prometheus/Grafana, as this is recommend by upstream, integrates well
with the services, and has a good feature set) to capture key metrics
like disk usage, API error rate, etc, and alert via email/IRC when
these hit error thresholds, is a high-priority but also high-time
task. Doing it myself requires learning a fair few new things, and
also downtime (see below) whilst deploying it. So far I have not had a
large enough timeslot.

There is also a single point of failure: I am the only administrator
who works on GitLab. Though Tollef and Keith have access to the Google
organisation and could do so, they don't have the time. If I were not
available, either they would have to go through the process of setting
up their accounts to control Kubernetes and familiarising themselves
with our deployment. This is obviously bad, especially as I am
relatively new to administering Kubernetes (as seen from the failures
in the timeline).

The length of the backup task completely blew out our outage window.
It should've been obvious that backups would take longer than they had
previously; even if not, we could've run a test task to measure this
before we announced an outage window which could never have been met.

My internet connection choosing that exact afternoon to be extremely
unreliable was quite unhelpful. If any of this was planned I would've
been somewhere with a much faster and more stable connection, but
unfortunately I didn't have the choice.

Though I'd tested some of these changes throughout the morning, I
hadn't tested the exact changes. I'm not sure how it is possible to
test some of them (e.g. how do I, with a scratch cluster, test that
persistent volumes created with a several-versions-old Kubernetes will
upgrade ... ?), but certainly I could've at least made the tests a
little more thorough and realistic, particularly for things like the
disk snapshots.


What went well
---------------------

No data loss: we had backups, and lots of them (at least three at
every point from when destructive operations began). There was no
danger at any point of data being lost for good, though some of those
options required unacceptably long downtime when used as a source.

Data being on GCS: having much of our data in cloud storage long
delayed the point we needed to


What we can do in future
----------------------------------

Monitoring, logging, and alerts:
https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the
task I've long ago filed to get this enabled. If anyone reading this
is familiar with the landscape and can help, I would be immensely
grateful for it.

Better communication: due to its last-minute nature, the outage was
only announced on IRC and not to the lists; I also didn't inform the
lists when the outage was dragging on, as I was so focused on what I
needed to do to bring it back up. This was obviously a bad idea. We
should look into how we can communicate these kinds of things a lot
better.

Cloud Console web UI: the web UI is borderline unusable for
interactive operation, due to long load times between pages and a lot
of inter-page navigation required to achieve anything. It would be
better to get more familiar with driving the console clients in anger,
and use those where possible.

More admins: having myself as the single GitLab admin is not in any
way sustainable, and we need to enlarge our group of admins. Having
someone actually familiar with Kubernetes deployments in particular,
would be a massive advantage. I'm learning on the spot from online
documentation, which given the speed that Kubernetes development moves
at, is often either useless or misleading.

Move away from Omnibus container: currently, as said, every service
behind gitlab.fd.o is deployed into a single Kubernetes pod, with
PostgreSQL and Redis in linked containers. The GitLab container is
called the 'omnibus' container, combining the web/API hosts,
background job processors, Git repository service, SSH access, Pages
server, etc. The container is a huge download, and on start it runs
Chef which takes approximately 2-3 minutes to configure the filesystem
before even thinking about starting GitLab. Total minimum downtime for
every change is 4-5 minutes, and every change makes the whole of
GitLab unavailable: this makes us really reluctant to change
configuration unless necessary, giving us less experience with
Kubernetes than we might otherwise have. GitLab upstream is working on
a 'cloud native' deployment, splitting each service into its own
container which does _not_ spend minutes running Omnibus at startup,
which can be independently configured without impacting other
services. Actually making this move will require multiple hours of
downtime, which will need to be communicated long in advance.

Move storage to resizeable volumes or GCS: at the next point where we
exhaust our disk space, we're going to need to go through this again.
Moving more of our storage to cloud storage, where we can, means that
we don't have to worry about it. When Kubernetes 1.11.x becomes
available through GCP, we can also recreate the disks as resizeable
volumes, which will grow on demand, avoiding this whole process. At
least we've learned quite a bit from doing it this time?


Cheers,
Daniel

[0]: The Helm charts and configuration are available at
https://gitlab.freedesktop.org/freedesktop/ with the exception of a
repository containing various secrets (SSH keys, API keys, passwords,
etc) which are grafted on to the existing chart.

Marek Olšák

2018-Jul-31 21:54 UTC

head link

[fdo] Postmortem: July 17th GitLab outage

We can also save space by using the main repo for private branches,
e.g. my branches would be in refs/heads/mareko/*.

Marek

On Sun, Jul 29, 2018 at 9:41 AM, Daniel Stone <daniel at fooishbar.org>
wrote:> Hi,
> On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC,
> whilst attempting to upgrade the underlying storage. This was a
> semi-planned outage, which we'd hoped would last for approximately
> 30min.
>
> During the outage, the GitLab web UI and API, as well as HTTPS git
> clones through https://gitlab.freedesktop.org, were completely
> unavailable, giving connection timeout errors. anongit and cgit
> remained completely functional. There was no data loss.
>
> The outage was 'semi-planned' in that it was only announced a
couple
> of hours in advance. It was also scheduled at one of the worst
> possible times: whilst all of Europe and the American east coast is
> active, with the west coast beginning to come online, and some of Asia
> still being online.
>
> Most of our outages happen early in the European morning, when we see
> the lightest usage from only eastern Europe and Asia being online, and
> only last for approximately five minutes.
>
>
> Background
> ----------------
>
> gitlab.freedesktop.org runs on the Google Cloud Platform, using the
> Google Kubernetes Engine and Helm charts[0]. The cluster currently
> runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes
> Pod, using the latest published GitLab CE image from gitlab.org (at
> time of writing this is 11.1.2, however at the time it was 11.0.4).
>
> Some GitLab data is stored in Google Cloud Storage buckets, including
> CI job artifacts and traces, file uploads, Git LFS data, and backups.
> The Git repositories themselves are stored inside a master data
> partition which is exposed to the container as a local filesystem
> through a Kubernetes PersistentVolumeClaim; other data is stored
> inside PostgreSQL which is again a Kubernetes PVC.
>
> Repository storage is (currently) just regular Git repositories.
> Forking a repository simply calls 'git clone' with no sharing of
> objects; storage is not deduplicated across forks.
>
> Kubernetes persistent volumes are not currently resizeable. There is
> alpha support in 1.10.x, scheduled to become general availability
> shortly.
>
> Backups are executed from a daily cronjob, set to run at 5am UTC: this
> executes a Rake Rails task inside the master GitLab pod. The backups
> cover all data, _except_ that which is already stored in GCS. Due to
> legacy reasons, backups are made by first capturing all the data to be
> backed up into a local directory; an uncompressed tarball is then
> created in the local directory which is uploaded to storage. This
> means that the directory used for backups must have a bit over twice
> the size of the final backup available to it as free space.
>
>
> Events
> ------------
>
> Shortly after 9am UTC, it became clear that the disk space situation
> on our GitLab cluster was critical. Some API requests were failing due
> to a shortage of disk space. A quick investigation showed this was due
> to a large number of recently-created forks of Mesa in particular,
> which requires ~2.1GB of storage per repository.
>
> The backup cron job had also started failing for the same reason. This
> made resolution quite critical: not only did we not have backups, but
> we were only one new Mesa or kernel fork away from totally exhausting
> our complete disk space, and potentially exposing users to data loss:
> e.g. trying to enter a new issue and having this fail, or being unable
> to push to their repositories.
>
> Before 10am UTC, it was announced that we would need urgent downtime
> later in the day, in order to expand our disk volumes. At this point,
> I spent a good deal of time researching Kubernetes persistent-volume
> resizing (something I'd previously looked into for when the situation
> arose), and doing trial migrations on a scratch Kubernetes cluster.
>
> At 1pm UTC, I announced that there would be a one-hour outage window
> in order to do this, from 2-2:30pm UTC.
>
> At 2pm UTC, I announced the window had begun, and started the yak-shave.
>
> Firstly, I modified the firewall rules to drop all inbound traffic to
> GitLab other than that from the IP I was working from, so others did
> not see any transient failures but instead just a connection timeout.
> It also ensured backup integrity: that we would be able to snapshot
> the data at a given point without worrying about losing changes made
> after that point.
>
> I took a manual backup of all the GitLab data (minus what was on GCS):
> this consisted of letting the usual backup Rake task run to the point
> of collecting all the data but stopped before running tar, as running
> tar would've exceed the disk space and killed the cluster. Instead, I
> ran 'tar' with its output streamed over SSH to an encrypted
partition
> on a local secure machine.
>
> Secondly, I took snapshots of all the data disks. Each Kubernetes
> PersistentVolume is backed by a Google Compute Engine disk, which can
> be manually snapshotted and tagged.
>
> Both of these steps took much longer than planned. The backup task was
> taking much longer than it had historically - in hindsight, it
> should've been clear that with one of our problems being a huge
> increase in backup size, that both generating and copying the backups
> would take far longer than they previously had. At this point, I
> announced an extension of the outage window from 2-3pm UTC.
>
> Snapshotting the disks also took far longer than hoped. I was working
> through Google's Cloud Console web UI, which can be infuriatingly
> slow: after a (not overly quick) page load, there is a good few
> seconds' wait whilst the data loads asynchronously and then populates
> page content. Working through to determine which disk was which,
> taking snapshots of its content and tagging those snapshots, took some
> time. This was compounded by not being familiar with the disk snapshot
> UI, and by an abundance of caution: I checked multiple times that we
> did in fact have storage of all the relevant volumes.
>
> After this was done, I upgraded the Kubernetes cluster to 1.10.x, and
> attempted to resize the persistent volumes, which immediately failed.
> It became clear at this point that I had missed two crucial details.
> Firstly, that it was not possible to make a static-sized disk
> resizeable on the fly: it would require destroying and then recreating
> the PersistentVolumes, then restoring the data on to those: either via
> restoring a backup image, or by simply copying the old content to the
> new volumes. Secondly, it became clear that Google's Kubernetes engine
> did _not_ in fact provide support for resizing disks, as it was an
> alpha feature which was not possible to enable.
>
> At this point I made sure the old persistent volumes would, in fact,
> persist after they had been orphaned. This gave us three copies of the
> data: in a local backup, in GCE disk snapshots, and in retention of
> the GCE disks themselves. I then spent some time figuring out how to
> pause service availability, so we could make the new disks available
> to be used by the cluster, without actually starting the services with
> a clean slate. This took a surprising amount of time, and was somewhat
> fiddly. During this time I also experienced a new failure mode of how
> we run Helm: that if some resources were unavailable (due to a typo),
> it would block indefinitely for them to become available (never)
> rather than fail immediately.
>
> I had long since previously started the process of copying the backups
> back towards the cluster, so that we had the option of restoring from
> backup if that was a good idea. However, at this point I started
> having serious degradation of my network connection: not only did my
> upload speed vary wildly (factor of 100x), but due to local issues the
> workstation I was using spent some time refusing to route HTTPS
> traffic to the Kubernetes control API. Much time was spent debugging
> and resolving this.
>
> The preferred option was to restore the previous snapshots into new
> disks: this meant we did not have to block on the backup upload and
> could be sure that we had exactly the same content as previously. I
> started pursuing this option: once I had ensured that the new
> Kubernetes persistent volumes had created new GCE disks, I attempted
> to restore the disk snapshots into them.
>
> At this point, I discovered the difference between GCE disks, disk
> images, and disk snapshots. It is not possible to directly restore a
> snapshot into a live disk: you must mount both the target disk and the
> source snapshot in a new GCE VM, boot the VM, and copy between the
> two. I did this with new GCE disks, and attempted to use those disks
> as backing storage for new Kubernetes PersistentVolumes that we could
> reuse directly. More time was lost due to the Helm failures above.
> Eventually when we got there, I discovered that creating a new
> Kubernetes PV/PVC from a GCE disk will obliterate all the content on
> that disk, so that avenue was useless.
>
> Quite some hours into the outage, I decided to take a fifteen-minute
> break, go for a walk outside and try to reason about what was going on
> and what we should do next.
>
> Coming back, I pursued the last good option: stop the Kubernetes
> services completely, attach botoh the new enlarged PVs and the old
> disk snapshots to a new ephemeral GCE VM, copy directly between those,
> stop the GCE VM, and restart Kubernetes. This mostly succeeded, except
> for subvolumes. Kubernetes exposes '$disk/mysql/' as the root
> mountpoint for the MySQL data volume, whereas mounting the raw disk
> exposes '$disk' as the root mountpoint. The copy didn't
correctly
> preserve the subdirectory, so though we had the Git data accessible,
> MySQL was seeing an empty database.
>
> To avoid any desynchronisation, I destroyed all the resources again,
> created completely new and empty volumes, created a new GCE VM, and
> re-did the copy with the correct directory structure. Coming back up,
> I manually verified that the list of repositories and users was the
> same as previously, worked through parts of the UI (e.g. could I still
> search for issues after not restoring the Redis cache?) and a few
> typical workflows. I had also started a full backup task in the
> background to ensure that our backup cron job would succeed in the
> morning without needing another outage.
>
> Once this was done, around 18:45 UTC, I restored public access and
> announced the end of the outage.
>
> A couple of days later, I spent some time cleaning up all the
> ephemeral resources created during the outage (persistent volumes,
> disks, disk snapshots, VMs, etc).
>
>
> What went badly
> -----------------------
>
> Many things.
>
> The first we realised things were wrong was when people mentioned the
> failure on IRC. Setting up a system (probably based on
> Prometheus/Grafana, as this is recommend by upstream, integrates well
> with the services, and has a good feature set) to capture key metrics
> like disk usage, API error rate, etc, and alert via email/IRC when
> these hit error thresholds, is a high-priority but also high-time
> task. Doing it myself requires learning a fair few new things, and
> also downtime (see below) whilst deploying it. So far I have not had a
> large enough timeslot.
>
> There is also a single point of failure: I am the only administrator
> who works on GitLab. Though Tollef and Keith have access to the Google
> organisation and could do so, they don't have the time. If I were not
> available, either they would have to go through the process of setting
> up their accounts to control Kubernetes and familiarising themselves
> with our deployment. This is obviously bad, especially as I am
> relatively new to administering Kubernetes (as seen from the failures
> in the timeline).
>
> The length of the backup task completely blew out our outage window.
> It should've been obvious that backups would take longer than they had
> previously; even if not, we could've run a test task to measure this
> before we announced an outage window which could never have been met.
>
> My internet connection choosing that exact afternoon to be extremely
> unreliable was quite unhelpful. If any of this was planned I would've
> been somewhere with a much faster and more stable connection, but
> unfortunately I didn't have the choice.
>
> Though I'd tested some of these changes throughout the morning, I
> hadn't tested the exact changes. I'm not sure how it is possible to
> test some of them (e.g. how do I, with a scratch cluster, test that
> persistent volumes created with a several-versions-old Kubernetes will
> upgrade ... ?), but certainly I could've at least made the tests a
> little more thorough and realistic, particularly for things like the
> disk snapshots.
>
>
> What went well
> ---------------------
>
> No data loss: we had backups, and lots of them (at least three at
> every point from when destructive operations began). There was no
> danger at any point of data being lost for good, though some of those
> options required unacceptably long downtime when used as a source.
>
> Data being on GCS: having much of our data in cloud storage long
> delayed the point we needed to
>
>
> What we can do in future
> ----------------------------------
>
> Monitoring, logging, and alerts:
> https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the
> task I've long ago filed to get this enabled. If anyone reading this
> is familiar with the landscape and can help, I would be immensely
> grateful for it.
>
> Better communication: due to its last-minute nature, the outage was
> only announced on IRC and not to the lists; I also didn't inform the
> lists when the outage was dragging on, as I was so focused on what I
> needed to do to bring it back up. This was obviously a bad idea. We
> should look into how we can communicate these kinds of things a lot
> better.
>
> Cloud Console web UI: the web UI is borderline unusable for
> interactive operation, due to long load times between pages and a lot
> of inter-page navigation required to achieve anything. It would be
> better to get more familiar with driving the console clients in anger,
> and use those where possible.
>
> More admins: having myself as the single GitLab admin is not in any
> way sustainable, and we need to enlarge our group of admins. Having
> someone actually familiar with Kubernetes deployments in particular,
> would be a massive advantage. I'm learning on the spot from online
> documentation, which given the speed that Kubernetes development moves
> at, is often either useless or misleading.
>
> Move away from Omnibus container: currently, as said, every service
> behind gitlab.fd.o is deployed into a single Kubernetes pod, with
> PostgreSQL and Redis in linked containers. The GitLab container is
> called the 'omnibus' container, combining the web/API hosts,
> background job processors, Git repository service, SSH access, Pages
> server, etc. The container is a huge download, and on start it runs
> Chef which takes approximately 2-3 minutes to configure the filesystem
> before even thinking about starting GitLab. Total minimum downtime for
> every change is 4-5 minutes, and every change makes the whole of
> GitLab unavailable: this makes us really reluctant to change
> configuration unless necessary, giving us less experience with
> Kubernetes than we might otherwise have. GitLab upstream is working on
> a 'cloud native' deployment, splitting each service into its own
> container which does _not_ spend minutes running Omnibus at startup,
> which can be independently configured without impacting other
> services. Actually making this move will require multiple hours of
> downtime, which will need to be communicated long in advance.
>
> Move storage to resizeable volumes or GCS: at the next point where we
> exhaust our disk space, we're going to need to go through this again.
> Moving more of our storage to cloud storage, where we can, means that
> we don't have to worry about it. When Kubernetes 1.11.x becomes
> available through GCP, we can also recreate the disks as resizeable
> volumes, which will grow on demand, avoiding this whole process. At
> least we've learned quite a bit from doing it this time?
>
>
> Cheers,
> Daniel
>
> [0]: The Helm charts and configuration are available at
> https://gitlab.freedesktop.org/freedesktop/ with the exception of a
> repository containing various secrets (SSH keys, API keys, passwords,
> etc) which are grafted on to the existing chart.
> _______________________________________________
> freedesktop mailing list
> freedesktop at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/freedesktop

Eric Anholt

2018-Aug-01 19:00 UTC

head link

[fdo] Postmortem: July 17th GitLab outage

Marek Olšák <maraeo at gmail.com> writes:
> We can also save space by using the main repo for private branches,
> e.g. my branches would be in refs/heads/mareko/*.
It sounds like gitlab is not going to have this fixed very soon:

https://gitlab.com/gitlab-org/gitlab-ce/issues/23029

I think this is an interesting idea. It would increase exposure of Mesa
developers to what each other is doing, without everyone needing to go
star each other's repos. I think Mesa has been a little too cautious
with branches in the main repo -- we very rarely push branches, and
we're definitely bad at deleting those old branch heads once they're no
longer relevant as branches for development. On the other hand, I'm not
sure everyone wants to see every weird unfinished branch I have, and a
personal repo is nice for hanging on to those.

Once we have gitlab CI hooked up (hopefully in the next week or so), I
want developers to be able to push to gitlab and have CI go through the
patch, before they actually submit to the ML. If we're pushing to
branches on origin, that could potentially be a lot of mailing list
spam. I'd want some sort of solution for that before we start telling
people to just put their stuff in user branches on the central repo
instead of personal forks. People can also be a lot more cavalier with
git push personalrepo +branch than I'd like to have them be with git
push origin +branch.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/freedesktop/attachments/20180801/858bf040/attachment.sig>

Apparently Analagous Threads

Search for more reasonably related threads

freedesktop - Jul 2018 - [fdo] Postmortem: July 17th GitLab outage

[fdo] Postmortem: July 17th GitLab outage

[fdo] Postmortem: July 17th GitLab outage

[fdo] Postmortem: July 17th GitLab outage

Apparently Analagous Threads