At 13:04 UTC we started noticing a higher number of 502 Bad Gateway errors which were affecting loading the web UI for serverdensity.io accounts. This was not affecting alerts processing, payload postbacks, service monitoring, the mobile apps or log storage.
Shortly before the errors were detected, we had deployed a change to production to enable a new LiveChat widget as part of our plan to expand and improve our customer support. Along with adding the widget into the UI, we had to make some changes to our Content Security Policy (CSP) headers. This provides a whitelist of external resources that are allowed to be loaded by the UI. By default, no external resources are allowed unless specifically whitelisted. This is to prevent malicious code injection in the event of a XSS vulnerability in our app, and is part of the multiple layers of security we have.
After some investigation, it turned out that for some browsers, the CSP header was too long for Nginx's header limit. This was causing Nginx to consider the backend server as down. We use Nginx as a load balancer so as it marked each backend down in turn, it eventually exhausted the pool of available backends. This would cause an intermittent 502 Bad Gateway error which would only last for a short time because the bad gateways would quickly "recover" and become available again.
We had seen a similar problem in our staging environment for the same CSP change several months ago, so were able to quickly identify this as the cause. For this change, none of our test browsers caused such large headers which is why this was not picked up during testing.
We applied a manual fix directly to production at 13:18 UTC, which resolved the errors for all users.
A full fix to code was then tested on one production load balancer around 14:20 and the final version released at 18:44 UTC. Total time for the intermittent errors was around 30 minutes.
As a result of this incident, we have improved our documentation around the CSP headers and removed some old, unused headers that had recently been disabled in production but not removed completely from the header. We will also be building a set of automated tests as part of our UI test suite to measure the header size and fail the build if it is too large.