Server Density is deployed across two geographically independent data centers. Our primary facility is in Washington DC, USA and we have a secondary facility in San Jose, USA which replicates traffic in real time. We can fail over between the two data centers with no or minimal customer impact depending on the outage type.
On 26 Aug at 15:45 UTC, our internal monitoring triggered an alert to notify us that our nodes in our secondary data center were no longer processing the usual level of traffic. Shortly afterwards, alerts were triggered to warn us that our database nodes in the secondary data center were starting to lag behind the primary data center. After some initial investigation, we discovered high latency and packet loss between our two data centers and opened a ticket with our provider, Softlayer, at 15:59 UTC.
At 16:48 UTC Softlayer issued an emergency alert that there had been a fibre cut between the datacenter aggregation routers and the backbone routers in Washington. This was caused by directional drilling being done by a construction crew about 3 miles from the data center. This meant there was reduced capacity and redundancy at that data center. Although this was not affecting customer service to Server Density, it was affecting our internal redundancy as we were seeing reduced capacity for replicating traffic to our secondary data center.
At 17:26 UTC, the remaining network link at the Washington data center failed under the increased load from the loss of the other link. This caused a complete service outage for 6 minutes before the network was restored.
Once the networking was restored, Server Density v2 (customers using account URLs ending in .io) came back online immediately - the total outage time for SDv2 was just under 10 minutes.
However, Server Density v1 (customers using account URLs ending in .com including customers using SDv2 but still using their SDv1 URLs because they have not yet shut down their SDv1 account) was still unavailable. It took a further 1h30m to investigate and locate the cause of the outage affecting just SDv1. Part of the queuing system which processes incoming data and deals with alerting was stuck in an inconsistent state. This was causing high latency in the processing handler, which was causing tasks to back up and exhaust the available resources. Once this was tracked down, the queue was reset and SDv1 was restored.
Full service to all Server Density components was restored at 19:09 UTC.
On 27 Aug at 04:31 UTC, around 3000ft (just under 1km) of new fibre optic cable was delivered on site and the repair process was started. This was completed at 08:04 UTC and full redundancy was restored to the Washington data center.
Unfortunately this incident combined the loss of redundancy with a second failure, which resulted in around 10 minutes of downtime for SDv2 customers and around 1h30m of downtime for SDv1 customers. Due to the loss of redundancy, we were unable to fail over to the secondary data center when the primary data center network failed a few hours later.
As a result of this incident, we will be taking the following actions:
In the meantime, we recommend that all customers still using SDv1 migrate to SDv2, which has more advanced failover capabilities. SDv1 is no longer actively developed but is still supported until we announce the sunset date later this year. Migration is mostly automated and only takes a few minutes to complete - there is a checklist at https://support.serverdensity.com/hc/en-us/articles/201433816-v1-migration-checklist and we're happy to assist where necessary.
We're sorry for the problems here. If you have any questions, please get in touch: firstname.lastname@example.org