Normal service has now been restored. We'll be reviewing the incident over the next few days and will release a post-mortem within the next week.
Jul 20, 23:02 BST
80% of our metrics cluster has now resync'd to the Washington data centre. This means about 20% of customer devices/availability checks will continue to be slow at retrieving historical metrics data. We estimate that this will be complete within the next 2 hours. This final process takes some time as we have a large quantity of metrics data to validate the consistency of across the connectivity between San Jose and Washington DC.
Jul 20, 20:15 BST
Our DNS has now been switched back to the original IP at 126.96.36.199 and the "no data received" alerting component re-enabled. You may still experience slow web UI and APIs as we restore our metrics database cluster in the primary data centre.
Jul 20, 15:35 BST
Server Density continues to operate in a degraded status. Power has now been restored in Washington and we are progressing through our recovery checklists. Once this has been completed and we have fully restored redundancy, we will switch the system back to full capacity. We expect this to be completed shortly.
Jul 20, 15:20 BST
As of 11:49 UTC, Server Density is currently operational but in a degraded state. This means that whilst all systems are functioning, there may be reduced performance when browsing our web UI and using APIs. This does not affect alert processing but "no data received" alerts are suspended to avoid false positives and availability monitoring check frequency has been reduced to every 10 minutes from the usual every 5 minutes.
Following a power failure at our primary data centre in Washington, USA, we have failed over to our secondary data centre in San Jose, USA. The power failure also caused a network outage which meant that we were unable to fail over quickly by repointing our IP to the backup data centre. This is a known failure condition and so instead we used our alternative plan by switching our DNS to a backup IP (188.8.131.52) outside of that facility.
Power is still out at the Washington data centre and we are now working on recovery steps. Additional capacity has been launched in our SJC data centre to help resolve the performance issues.
We'll continue to update this as we progress through this incident.
Jul 20, 13:28 BST
Our ops team have manually failed over to another load balancer and services are up at this time. Servers were unable to post back between 11:03 and 11:49 UTC. We're continuing to monitor and will update this status post with an all clear and a detailed RFO once available.
If you hard coded your Server Density URL into your hosts file, you'll need to update this to 184.108.40.206. Otherwise, servers will post back once the DNS changes propagate globally (5-10 minutes).
Jul 20, 12:59 BST
SoftLayer's WDC data centre has suffered a power failure in the room where the majority of our servers reside. This has also resulted in us being unable to failover to our secondary data centre. On-site techs are working to restore power ASAP.
Jul 20, 12:39 BST
We're looking into problems with our load balancers which is rendering accounts and payload processing unavailable. During this time monitored servers will be unable to post back to Server Density which may result in no data alerts.
Jul 20, 12:14 BST