thr3ads.net - CentOS announce - [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Karanbir Singh

2016-Mar-30 10:25 UTC

[CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

== What happened =
On Wednesday February 24th, at  6pm UTC time, the DC hosting some of
the CentOS equipments used for various roles had suffered from
multiple electricity power outages. The facility was completely dark
for just under 2 hrs, and we were able to start recovering services by
8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb
25th we had all services restored.

That meant that the machines in those racks were running on batteries
(ups in the racks) but finally went down in an uncontrolled way due to
lack ot communication with that UPS.

Subsequent on Monday March 14th, we suffered another power outage in
the racks, this time due to a overload on the rack power circuits.

== Services that were impacted = - severity critical : mirrorlist.centos.org
node (IPv6) went down
(while multiple mirrorlist.centos.org nodes for IPv4 nodes were still
online). That means that machines with only IPV6 connectivity couldn't
get yum to work to retrieve the list of nearest mirrors.
 - severity medium : Our main buildservices queue management services
were down; note: this did not impact our ability to build, test and
deliver updates.
 - severity medium : www.centos.org and www.centos.org/forums weren't
reachable through IPv6 : at the moment, those services are natively
reachable through IPv4, but proxied through nodes in that DC for IPv6
users. Most tested browsers were falling back to IPv4 during that period
 - severity medium : CentOS DevCloud
(https://wiki.centos.org/DevCloud) : that means that CentOS Developers
weren't able to instantiate new CentOS test VMs for their work, but
also weren't able to reach the existing ones.
 - severity low : several publicly facing small services like
http://planet.centos.org , http://seven.centos.org (not critical and
could be restored quickly to other VMs elsewhere)
 - severity low : the server leading the armv7hl builds for the Plague
build farm was also offline, meaning no armhfp build during that
timeframe (but not updates were to be built, so mitigated issue)

= Followup actions and notes
   Over the years, the baseline recovery model we've used and tried to
enforce is one of 'restore in place', take a downtime hit if needed -
and ensure we have service continuity for the user facing components (
the mirrorlist service, the centos update and content distribution
services). For other resources, like the main website etc, we ensure
there are good backups available in multiple places, usable to restore
services should there be a need. This model has worked well for us
over the years, and we've had very little, if any, service outages
that had a user impact. The restore in place/restore outside HA also
meant we were able to better utilise the exclusively sponsored
machines we rely on.

   However, as the project grows, with a lot more infrastructure being
consolidated into a few locations for non CDN services, our exposure
to service downtime has dramatically increased. Its clear that we need
to expand the scope of where  we backup to, how we backup, how we
anticipate failure and our ability to restore services in a timely
manner should there be facilities outages. In the coming weeks, we are
going to undertake a deep dive into our Infrastructure design and
delivery and try to first come up with a consolidated set of risks we
need to manage against, and then work towards reducing the risk,
spreading the availability as needed.

   Our backend storage platform for the DevCloud and persistent
storage for other nodes in the facility is run from a distributed,
replicated Gluster setup. Inspite of the sudden loss of power, in a
production environment with hundreds of running VMs and dozens of
running data jobs, we were able to trivially recover our entire data
set with minimum data loss. Some of the running VMs inside the
DevCloud did see local filesystem issues, but we dont think that was a
backing storage issue. This event has dramatically increased out
confidence in the gluster technology stack and we will certainly be
looking at extending deployments for it internally.

== Comments about hosting facility =
   Their Status post about this
http://status.uk2.net/2016/02/24/london-power-outage/

   We have multiple racks at this facility, and have a long standing
relationship with them going back to late Summer 2012. Over this
period we have had a near perfect uptime record for our equipment
there. And above all we have been consistently impressed with the
speed of and the knowledgeable support we've recieved at the DC. In
many cases, how the facility reacts to outage defines the real service
value - and in this case, we can only commend the fantastic support we
had through the outage hours. We do however feel there could be better
monitoring and reporting of some of the facilities information and
will be working with them to improve in those regards.

Fabian Arrotin and Karanbir Singh
The CentOS Project
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQEcBAEBAgAGBQJW+6mPAAoJEI3Oi2Mx7xbtHo8IAI+RVIDjGwJOzgJ5Ry7mHwLe
Zc+aBUQklDk5oRaDk7QZHsaGp1lclNsutBk3YujNlXwMC4hUKdPwkTVuX50usQ7s
kd7qF1BlElNyfMPfFJGwchIQBDOZqZxkZP4uOrvQUnIZUYfyx6NnPnGS0uatBdnw
hBJ6TbgP6i50h7U0fNWjHU2I8xe0zsx1jVrvNngDMlQcIHC0d1KMtpOgSMR5f9Bn
bLwghfD4/yPyqJP1sc+021ANk1+a7uXs4KKG3MXpMlFyvYmv2ict0Q/sDtz0jzCx
kbRgDGm/GF1TUUENciESkHPKy3kLWA1oCicOkiEhzNz2YwFQNdNpi9PqWEK/F5Q=bDIN
-----END PGP SIGNATURE-----

Always Learning

2016-Apr-02 16:07 UTC

head link

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

On Wed, 2016-03-30 at 11:25 +0100, Karanbir Singh wrote:
> On Wednesday February 24th, at  6pm UTC time, the DC hosting some of
> the CentOS equipments used for various roles had suffered from
> multiple electricity power outages. The facility was completely dark
> for just under 2 hrs, and we were able to start recovering services by
> 8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb
> 25th we had all services restored.
No emergency diesel generators ?  

I live in an area, having an unpublicised nationally strategic telecoms
hub, with 9+ discrete data centres within 1.5 km of a power station
(linked by dedicated underground cable to another power station) yet
each data centre has at least one diesel generator.

Resilience seems to be their motto.

-- 
Regards,

Paul.
England, EU.      England's place is in the European Union.

Valeri Galtsev

2016-Apr-02 16:19 UTC

head link

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

On Sat, April 2, 2016 11:07 am, Always Learning wrote:>
> On Wed, 2016-03-30 at 11:25 +0100, Karanbir Singh wrote:
>
>> On Wednesday February 24th, at  6pm UTC time, the DC hosting some of
>> the CentOS equipments used for various roles had suffered from
>> multiple electricity power outages. The facility was completely dark
>> for just under 2 hrs, and we were able to start recovering services by
>> 8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb
>> 25th we had all services restored.
>
> No emergency diesel generators ?
Somebody has to donate that...

Valeri
>
> I live in an area, having an unpublicised nationally strategic telecoms
> hub, with 9+ discrete data centres within 1.5 km of a power station
> (linked by dedicated underground cable to another power station) yet
> each data centre has at least one diesel generator.
>
> Resilience seems to be their motto.
>
>
>
> --
> Regards,
>
> Paul.
> England, EU.      England's place is in the European Union.
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Karanbir Singh

2016-Apr-05 23:14 UTC

head link

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

On 02/04/16 17:07, Always Learning wrote:> 
> On Wed, 2016-03-30 at 11:25 +0100, Karanbir Singh wrote:
> 
>> On Wednesday February 24th, at  6pm UTC time, the DC hosting some of
>> the CentOS equipments used for various roles had suffered from
>> multiple electricity power outages. The facility was completely dark
>> for just under 2 hrs, and we were able to start recovering services by
>> 8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb
>> 25th we had all services restored.
> 
> No emergency diesel generators ?  
I believe the second fail-back operation is what caused most of the
issues, this was failing back from the backup source to the mains once
they were live again.

regards


-- 
Karanbir Singh
+44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh
GnuPG Key : http://www.karan.org/publickey.asc

Maybe Matching Threads

Search for more possibly parallel threads

CentOS announce - Mar 2016 - Notice of Service Outage and followup LON1/UK Facility

[CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

[CentOS] [CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

Maybe Matching Threads