Service unavailability

Incident Report for Server Density

Postmortem

At 18:40 UTC a code change was pushed to our upstream Puppet Nginx module to make some Puppet Lint based changes. This was part of some work to tidy up the module to ensure we were using proper types as required by an internal change to some logging functionality.

This change broke some functionality elsewhere in another Puppet module which was relying on the old, incorrect types. This was an unexpected regression which was not picked up.

We automatically deploy the master branch of our Puppet module every 10 minutes and as a result, the new code was deployed to the Puppet Master at 18:50 UTC. Shortly afterwards, the regular Puppet agent run deployed the changes in the Puppet module to our standby load balancer which caused it to fail. Puppet agents run every 30 minutes at staggered intervals and this ran on the primary load balancer at 19:01 UTC.

The primary load balancer immediately failed. This was detected by our failover system and the system automatically failed over to the standby load balancer. However, since that load balancer had already failed, service was lost.

Our response started at 19:03 UTC and we were able to revert the failover and restart the primary load balancer at 19:04 UTC. However, due to the error in the config caused by the Puppet module changes, the failover process was triggered again and our public IP was re-routed to the failed load balancer at 19:06 UTC.

After more detailed investigation, we diagnosed the error in the Puppet module at 19:15 UTC and deployed a code fix at 19:16 UTC. This was fully deployed and service restored at 19:19 UTC.

The root cause of this outage was a bad code change that was pushed to production.

We are currently nearing the end of a project to rework most of our Puppet modules to use official modules where possible, and have a full test/staging environment for all changes. We are currently still running several legacy modules which are being retired and replaced with the new Puppet setup because they are difficult to test properly, even with the normal code review process we go through. This will allow us to eliminate this kind of error in the future.

Posted Oct 16, 2014 - 23:35 BST

Resolved

This has now been resolved. A full post mortem will be provided shortly.

Posted Oct 16, 2014 - 23:17 BST

Identified

The problem has been resolved and services are back online. We're running through our post outage checklist now.

Posted Oct 16, 2014 - 20:20 BST

Investigating

We are currently experiencing an issue with both of our load balancers. Server Density v2 is currently unavailable.

Posted Oct 16, 2014 - 20:16 BST