Service unavailability

Incident Report for Server Density

Postmortem

On July 20, between 11:03 and 11:49 UTC, devices payloads were not received and alerting was unavailable. This was caused by a power failure at our primary data centre with Softlayer in Washington, USA. The power failure was caused by a cascade of problems:

Utility Failure: Ancillary low voltage and ground wiring within the exterior medium voltage utility switch gear arced causing the utility main breaker feeding to trip.
Primary Generator Failure: A disconnected control power cable and a loose sensing wire for the PLC prevented the primary generator breaker from closing.
Reserve Generator Failure:
1. A normally open contactor prevented the start signal from reaching the reserve generator. The contactor needs control power from the PLC to remain closed. Since control power was unavailable, the open contactor prevented the starting of the reserve generator.
2. A point-of-use UPS unit that provides back-up power to the reserve generator PLC failed.

The power failure took half of our servers offline at that location. We deploy across x2 independent rooms which meant we still had sufficient capacity to run the service. This is usually achieved by switching our routing to the alternative load balancer without needing to change the public IP address. Softlayer has a routing capability where we can switch the traffic at IP level to different hardware targets in less than a minute, allowing us to automatically respond to such failures as front end load balancer down.

However, the power failure also caused unavailability of Softlayer routers that implement that failover mechanism. We have simulated this kind of failure in the past and it is a known failure condition / limitation of the product. Instead, we decided to apply the backup plan by switching our DNS to the replacement systems. We keep an explicit low TTL on these DNS entries exactly for this.

Device payload processing was restored soon after the DNS switch however, due to the reduced capacity, Server Density UI response times were higher than normal and we ran on a deliberate degraded state until 22:02 when, after power was restored, we finished bringing the systems in the failed room back online and up to sync. Reboot is very quick but transferring data from the downtime window took some time because it had to be copied from our secondary facility in San Jose, California.

We expect this type of failure (where we have to use DNS) to cause up to 20 minutes of downtime due to the time needed to diagnose, troubleshoot then perform the DNS. In this case it took 46 minutes to restore service. This extended outage was caused by human factors around unfamiliarity with DNS failover procedures.

After reviewing the outage response, we have introduced the following improvements:

We have provisioned new hardware at Washington to allow us to store an additional copy of all customer data. This will allow us to re-sync from within the local network rather than be slowed down by the cross-country latency. This is due to be completed within the week as we sync each shard in turn. This takes some time due to the volume of data to sync.
We are in the process of moving all our endpoints behind Cloudflare. This was a separate project already planned but will also give us the bonus of instant DNS changes because all requests route through Cloudflare, and so changes to our internal IPs will be reflected immediately throughout the Cloudflare network with no change to our public facing IPs.
We have introduced war games to the team targeting improvement in their response times for such complex and rare failures. The first of these was conducted last week and we have them scheduled every 3 months.

This writeup was slightly delayed as we waited on the full root cause analysis from Softlayer, which we now have. A number of subsequent actions were taken by Softlayer provider based on the preliminary root cause analysis included:

Replacing and re-securing damaged grounding wire pertaining to the utility to prevent contact with the energized utility equipment;
Re-connection of the control power and sensing wires at the primary generator to ensure proper load transfer; and
Re-wiring start signal cable to bypass a contact point on the reserve generator system.

To prevent this failure from happening again, Softlayer have provided this remediation statement:

As part of the standard operation procedure, the facility provider performs monthly load tests that simulate a utility power failure and validates protection capabilities of the electrical infrastructure. The monthly load tests qualify the proper execution of tasks for the UPS systems, generators, and PLC switch gear systems. During routine these routine procedures, the electrical infrastructure operated as designed and showed no indications of failure.

The facility provider also performs a 5 year de-energized maintenance on all SoftLayer data centers. The purpose of this maintenance is to perform a full inspection on the electrical infrastructure which includes verifying connections are secure and that cables are not at risk of failing.

The following preventative actions will be taken by facility provider on top of evaluating existing maintenance procedures for improvements:

Utility Failure: All wiring in utility cubicles at the WDC01 campus will be inspected.
Primary Generator Failure: The facilities provider and the equipment vendor will evaluate a maintenance strategy to revalidate configuration and installment of electrical equipment prior to the 5 year de-energized maintenance that would normally identify these types of issues. SoftLayer will coordinate with the facility provider to make sure all SoftLayer datacenters are evaluated.
Reserve Generator Failure:
- After review by Design Engineers and Control System Vendor, the normally open contactor is not needed. This contactor is part of an older style relay assembly that is no longer used. All current systems with this type of assembly will be evaluated and corrected with all SoftLayer datacenters. Increased battery replacement frequencies of point-of-use type UPS units will be implemented.
- Commissioning Documentation Review: Initial commissioning documentation is under further review to identify other actions that may be required to prevent further similar incidents.

Posted Aug 12, 2015 - 18:47 BST

Resolved

Normal service has now been restored. We'll be reviewing the incident over the next few days and will release a post-mortem within the next week.

Posted Jul 20, 2015 - 23:02 BST

Update

80% of our metrics cluster has now resync'd to the Washington data centre. This means about 20% of customer devices/availability checks will continue to be slow at retrieving historical metrics data. We estimate that this will be complete within the next 2 hours. This final process takes some time as we have a large quantity of metrics data to validate the consistency of across the connectivity between San Jose and Washington DC.

Posted Jul 20, 2015 - 20:15 BST

Update

Our DNS has now been switched back to the original IP at 108.168.255.193 and the "no data received" alerting component re-enabled. You may still experience slow web UI and APIs as we restore our metrics database cluster in the primary data centre.

Posted Jul 20, 2015 - 15:35 BST

Update

Server Density continues to operate in a degraded status. Power has now been restored in Washington and we are progressing through our recovery checklists. Once this has been completed and we have fully restored redundancy, we will switch the system back to full capacity. We expect this to be completed shortly.

Posted Jul 20, 2015 - 15:20 BST

Update

As of 11:49 UTC, Server Density is currently operational but in a degraded state. This means that whilst all systems are functioning, there may be reduced performance when browsing our web UI and using APIs. This does not affect alert processing but "no data received" alerts are suspended to avoid false positives and availability monitoring check frequency has been reduced to every 10 minutes from the usual every 5 minutes.

Following a power failure at our primary data centre in Washington, USA, we have failed over to our secondary data centre in San Jose, USA. The power failure also caused a network outage which meant that we were unable to fail over quickly by repointing our IP to the backup data centre. This is a known failure condition and so instead we used our alternative plan by switching our DNS to a backup IP (108.168.255.44) outside of that facility.

Power is still out at the Washington data centre and we are now working on recovery steps. Additional capacity has been launched in our SJC data centre to help resolve the performance issues.

We'll continue to update this as we progress through this incident.

Posted Jul 20, 2015 - 13:28 BST

Monitoring

Our ops team have manually failed over to another load balancer and services are up at this time. Servers were unable to post back between 11:03 and 11:49 UTC. We're continuing to monitor and will update this status post with an all clear and a detailed RFO once available.

If you hard coded your Server Density URL into your hosts file, you'll need to update this to 108.168.255.44. Otherwise, servers will post back once the DNS changes propagate globally (5-10 minutes).

Posted Jul 20, 2015 - 12:59 BST

Identified

SoftLayer's WDC data centre has suffered a power failure in the room where the majority of our servers reside. This has also resulted in us being unable to failover to our secondary data centre. On-site techs are working to restore power ASAP.

Posted Jul 20, 2015 - 12:39 BST

Investigating

We're looking into problems with our load balancers which is rendering accounts and payload processing unavailable. During this time monitored servers will be unable to post back to Server Density which may result in no data alerts.

Posted Jul 20, 2015 - 12:14 BST