Service unavailability
Incident Report for Server Density
Postmortem

Server Density is deployed across two geographically independent data centers. Our primary facility is in Washington DC, USA and we have a secondary facility in San Jose, USA which replicates traffic in real time. We can fail over between the two data centers with no or minimal customer impact depending on the outage type.

On 26 Aug at 15:45 UTC, our internal monitoring triggered an alert to notify us that our nodes in our secondary data center were no longer processing the usual level of traffic. Shortly afterwards, alerts were triggered to warn us that our database nodes in the secondary data center were starting to lag behind the primary data center. After some initial investigation, we discovered high latency and packet loss between our two data centers and opened a ticket with our provider, Softlayer, at 15:59 UTC.

At 16:48 UTC Softlayer issued an emergency alert that there had been a fibre cut between the datacenter aggregation routers and the backbone routers in Washington. This was caused by directional drilling being done by a construction crew about 3 miles from the data center. This meant there was reduced capacity and redundancy at that data center. Although this was not affecting customer service to Server Density, it was affecting our internal redundancy as we were seeing reduced capacity for replicating traffic to our secondary data center.

At 17:26 UTC, the remaining network link at the Washington data center failed under the increased load from the loss of the other link. This caused a complete service outage for 6 minutes before the network was restored.

Once the networking was restored, Server Density v2 (customers using account URLs ending in .io) came back online immediately - the total outage time for SDv2 was just under 10 minutes.

However, Server Density v1 (customers using account URLs ending in .com including customers using SDv2 but still using their SDv1 URLs because they have not yet shut down their SDv1 account) was still unavailable. It took a further 1h30m to investigate and locate the cause of the outage affecting just SDv1. Part of the queuing system which processes incoming data and deals with alerting was stuck in an inconsistent state. This was causing high latency in the processing handler, which was causing tasks to back up and exhaust the available resources. Once this was tracked down, the queue was reset and SDv1 was restored.

Full service to all Server Density components was restored at 19:09 UTC.

On 27 Aug at 04:31 UTC, around 3000ft (just under 1km) of new fibre optic cable was delivered on site and the repair process was started. This was completed at 08:04 UTC and full redundancy was restored to the Washington data center.

Unfortunately this incident combined the loss of redundancy with a second failure, which resulted in around 10 minutes of downtime for SDv2 customers and around 1h30m of downtime for SDv1 customers. Due to the loss of redundancy, we were unable to fail over to the secondary data center when the primary data center network failed a few hours later.

As a result of this incident, we will be taking the following actions:

  • Reviewing the cause of the second network failure
  • Reviewing our data center deployments to see if we can design redundancy around this type of failure

In the meantime, we recommend that all customers still using SDv1 migrate to SDv2, which has more advanced failover capabilities. SDv1 is no longer actively developed but is still supported until we announce the sunset date later this year. Migration is mostly automated and only takes a few minutes to complete - there is a checklist at https://support.serverdensity.com/hc/en-us/articles/201433816-v1-migration-checklist and we're happy to assist where necessary.

We're sorry for the problems here. If you have any questions, please get in touch: hello@serverdensity.com

Posted Aug 27, 2014 - 15:02 BST

Resolved
The fiber optic cable provider has informed us the finished the splicing and network connectivity has been restored. We will now close this incident. A full analysis will be posted after we complete a full investigation in the next few days.
Posted Aug 27, 2014 - 10:00 BST
Update
Softlayer has informed us that there were delays in pulling the new fiber and that the splicing procedure begun only at 04:30 UTC.
Posted Aug 27, 2014 - 08:34 BST
Update
We have received an update via Softlayer that fiber crews will need to replace 2500 - 3000 ft of cable. The new fiber reel is en route and the crews are expecting to have the new cable spliced by 2014-08-27 2:00 UTC.
Posted Aug 26, 2014 - 21:14 BST
Monitoring
Service to SDv1 has been restored. All services are now running normally but we are in a state of reduced redundancy due to the fibre cut at Softlayer. We will update this if any further issues arise and when we receive updates from Softlayer.
Posted Aug 26, 2014 - 20:09 BST
Update
Softlayer are continuing to work on the networking issue affecting connectivity in Washington which is causing reduced redundancy due to a fibre cut. All services are operational except for customers still on our SDv1 platform (users of account URLs ending in .com). We are still investigating the cause of the outage here.
Posted Aug 26, 2014 - 19:58 BST
Update
The networking issue between our primary and secondary data centers is still ongoing. SDv2 (users of account URLs ending in .io) is fully operational. SDv1 (users of account URLs ending in .com) is currently unavailable - we are investigating the cause.
Posted Aug 26, 2014 - 19:31 BST
Update
Service has now been restored but the ongoing networking issue affecting internal traffic between our two data centers is still ongoing, which may have further impact on our redundancy. Updates will follow as we have them from Softlayer.
Posted Aug 26, 2014 - 18:57 BST
Update
As of 26-Aug-2014 15:47 UTC, multiple redundant links failed between the aggregation routers and the backbone routers in our primary data center at Softlayer Washington DC, USA due to a fibre cut. This caused increased latency and packet loss for internal traffic between the primary data center and our secondary data center in San Jose, USA affecting the replication between facilities but not affecting customer service to Server Density.

Softlayer informed us this was being investigated.

At around 17:30 UTC, the situation escalated and our primary facility network became unavailable, affecting customer services. Due to the on-going networking issue and replication lag, we have been unable to fail over to the secondary data center.

All services are effected by this but we are seeing systems recover and so are now starting our post outage checklists to verify all systems. Further updates will follow shortly.
Posted Aug 26, 2014 - 18:46 BST
Identified
Our provider SoftLayer is experiencing a complete unavailability to their Washington data center, our current primary data center. This is causing a complete V1 and V2 service unavailability.
We'll be updating this as we receive more information from SoftLayer.
Posted Aug 26, 2014 - 18:36 BST