Random gateway errors when accessing serverdensity.io
Incident Report for Server Density
Postmortem

At 13:04 UTC we started noticing a higher number of 502 Bad Gateway errors which were affecting loading the web UI for serverdensity.io accounts. This was not affecting alerts processing, payload postbacks, service monitoring, the mobile apps or log storage.

Shortly before the errors were detected, we had deployed a change to production to enable a new LiveChat widget as part of our plan to expand and improve our customer support. Along with adding the widget into the UI, we had to make some changes to our Content Security Policy (CSP) headers. This provides a whitelist of external resources that are allowed to be loaded by the UI. By default, no external resources are allowed unless specifically whitelisted. This is to prevent malicious code injection in the event of a XSS vulnerability in our app, and is part of the multiple layers of security we have.

After some investigation, it turned out that for some browsers, the CSP header was too long for Nginx's header limit. This was causing Nginx to consider the backend server as down. We use Nginx as a load balancer so as it marked each backend down in turn, it eventually exhausted the pool of available backends. This would cause an intermittent 502 Bad Gateway error which would only last for a short time because the bad gateways would quickly "recover" and become available again.

We had seen a similar problem in our staging environment for the same CSP change several months ago, so were able to quickly identify this as the cause. For this change, none of our test browsers caused such large headers which is why this was not picked up during testing.

We applied a manual fix directly to production at 13:18 UTC, which resolved the errors for all users.

A full fix to code was then tested on one production load balancer around 14:20 and the final version released at 18:44 UTC. Total time for the intermittent errors was around 30 minutes.

As a result of this incident, we have improved our documentation around the CSP headers and removed some old, unused headers that had recently been disabled in production but not removed completely from the header. We will also be building a set of automated tests as part of our UI test suite to measure the header size and fail the build if it is too large.

Posted Dec 01, 2014 - 10:56 GMT

Resolved
A permanent correction for this issue has now been released.
Posted Nov 20, 2014 - 18:44 GMT
Monitoring
We're still working on a permanent correction for this problem. Once done and when pushing it out, the gateway error may be displayed again for a short period of time.
Posted Nov 20, 2014 - 14:20 GMT
Identified
We've identified the cause for this error. A work around has been put in place while we work on a permanent fix. The gateway error should not be displayed again.
Posted Nov 20, 2014 - 13:18 GMT
Investigating
We're randomly seeing a 'Bad gateway' error when accessing .io accounts site. Investigation has started and we expect to have a correction soon. This does not affect device or service monitoring.
Posted Nov 20, 2014 - 13:04 GMT