This performance degradation was caused by a bug in the routing component we use to handle traffic through our web UI. The web UI is effectively an API client on top of our internal APIs, which are also exposed through our public API. The routing component simply adds authentication.
During this incident, the routing layer used up all the resources available to it and maxed out the CPU on the server it was running on. Due to lack of visibility into the specific CPU utilisation of the routing process, this was not caught by our monitoring. A limitation in the implementation also meant that the routing component is bound to a single core.
The UI uses a separate load balancer from the monitoring agent postback handling, so alerting and monitoring was unaffected - this incident was isolated to the web UI.
The immediate resolution was to switch to one of our hot standby load balancers, solving the problem in the short term. In the medium term, we have alleviated the resource contention by giving the routing component more CPU power. We also improved the monitoring on the component so we now have detailed statistics about resource usage.
Since then, we have been working on improvements to the component to allow it to use multiple cores. As anyone who has worked with multi-core processing and/or threading is aware, adding multi-core capabilities is non-trivial. By design, we have a very simple, stateless load balancing setup and so are working on ensuring we can continue to use stateless load balancers even as requests are directed across multiple cores. This will allow us to take advantage of more CPU resources and then scale across multiple, stateless load balancers instead of having to use something like DNS round robin load balancing, which isn't as intelligent or flexible.
We are expecting this work to be finished within the next week, which is more than enough time before we need to look at faster CPUs and/or DNS round robin load balancing as an interim solution.