On July 20, between 11:03 and 11:49 UTC, devices payloads were not received and alerting was unavailable. This was caused by a power failure at our primary data centre with Softlayer in Washington, USA. The power failure was caused by a cascade of problems:
The power failure took half of our servers offline at that location. We deploy across x2 independent rooms which meant we still had sufficient capacity to run the service. This is usually achieved by switching our routing to the alternative load balancer without needing to change the public IP address. Softlayer has a routing capability where we can switch the traffic at IP level to different hardware targets in less than a minute, allowing us to automatically respond to such failures as front end load balancer down.
However, the power failure also caused unavailability of Softlayer routers that implement that failover mechanism. We have simulated this kind of failure in the past and it is a known failure condition / limitation of the product. Instead, we decided to apply the backup plan by switching our DNS to the replacement systems. We keep an explicit low TTL on these DNS entries exactly for this.
Device payload processing was restored soon after the DNS switch however, due to the reduced capacity, Server Density UI response times were higher than normal and we ran on a deliberate degraded state until 22:02 when, after power was restored, we finished bringing the systems in the failed room back online and up to sync. Reboot is very quick but transferring data from the downtime window took some time because it had to be copied from our secondary facility in San Jose, California.
We expect this type of failure (where we have to use DNS) to cause up to 20 minutes of downtime due to the time needed to diagnose, troubleshoot then perform the DNS. In this case it took 46 minutes to restore service. This extended outage was caused by human factors around unfamiliarity with DNS failover procedures.
After reviewing the outage response, we have introduced the following improvements:
This writeup was slightly delayed as we waited on the full root cause analysis from Softlayer, which we now have. A number of subsequent actions were taken by Softlayer provider based on the preliminary root cause analysis included:
To prevent this failure from happening again, Softlayer have provided this remediation statement:
As part of the standard operation procedure, the facility provider performs monthly load tests that simulate a utility power failure and validates protection capabilities of the electrical infrastructure. The monthly load tests qualify the proper execution of tasks for the UPS systems, generators, and PLC switch gear systems. During routine these routine procedures, the electrical infrastructure operated as designed and showed no indications of failure.
The facility provider also performs a 5 year de-energized maintenance on all SoftLayer data centers. The purpose of this maintenance is to perform a full inspection on the electrical infrastructure which includes verifying connections are secure and that cables are not at risk of failing.
The following preventative actions will be taken by facility provider on top of evaluating existing maintenance procedures for improvements: