V2 UI unresponsive

Incident Report for Server Density

Postmortem

This performance degradation was caused by a bug in the routing component we use to handle traffic through our web UI. The web UI is effectively an API client on top of our internal APIs, which are also exposed through our public API. The routing component simply adds authentication.

During this incident, the routing layer used up all the resources available to it and maxed out the CPU on the server it was running on. Due to lack of visibility into the specific CPU utilisation of the routing process, this was not caught by our monitoring. A limitation in the implementation also meant that the routing component is bound to a single core.

The UI uses a separate load balancer from the monitoring agent postback handling, so alerting and monitoring was unaffected - this incident was isolated to the web UI.

The immediate resolution was to switch to one of our hot standby load balancers, solving the problem in the short term. In the medium term, we have alleviated the resource contention by giving the routing component more CPU power. We also improved the monitoring on the component so we now have detailed statistics about resource usage.

Since then, we have been working on improvements to the component to allow it to use multiple cores. As anyone who has worked with multi-core processing and/or threading is aware, adding multi-core capabilities is non-trivial. By design, we have a very simple, stateless load balancing setup and so are working on ensuring we can continue to use stateless load balancers even as requests are directed across multiple cores. This will allow us to take advantage of more CPU resources and then scale across multiple, stateless load balancers instead of having to use something like DNS round robin load balancing, which isn't as intelligent or flexible.

We are expecting this work to be finished within the next week, which is more than enough time before we need to look at faster CPUs and/or DNS round robin load balancing as an interim solution.

Posted Aug 27, 2014 - 15:40 BST

Resolved

This issue is now resolved and we'll be releasing a complete root cause analysis in the next few days.

Posted Aug 21, 2014 - 19:06 BST

Monitoring

We have forced a failover and restarted the affected component. V2 UI is available again.
We'll closely monitor this for the next few hours while investigating the root cause.

Posted Aug 21, 2014 - 14:43 BST

Identified

We have identified the issue to be related to our front end component that accepts and handles users connections.
Data collection and alerting does not appear to be affected.

Posted Aug 21, 2014 - 14:35 BST

Investigating

We are currently experiencing failure to load the V2 UI (.io domains). Investigation is ongoing and we'll update this as we gather more information.

Posted Aug 21, 2014 - 14:15 BST