Notifications processing stuck
Incident Report for Server Density
Postmortem

This is a longer postmortem than usual so we can describe some of the design of Server Density and where the multiple failures leading to this incident happened. Also, we consider this incident, and the one from the 6th with similar consequence (https://status.serverdensity.com/incidents/594sk8db19rx), extremely severe and, since then, we have been actively working on improvements to, besides detecting re-occurences immediately, completely prevent it from re-ocurring.

Server Density is built on a micro services architecture, where each of these services plays one role and one role only. When one service needs to perform a task it does not implement, it invokes another service that will carry it out and return the result to the calling service.

Alerting is mostly handled by two separate services: one dedicated to the alerting rules engine, processing events and managing triggered alerts; and a second service for sending notifications to various 3rd party providers such as email, SMS and PagerDuty.

The incident on Feb 6 was caused by the notification service stalling when trying to reach its message broker after a brief network interruption, not correctly handling the reconnect and so failing to complete the process of sending notifications. This is a software bug we have a fix in progress for.

The incident on Feb 18 was caused by the failure of several connected systems. Our billing service relies on a database cluster which experienced a failover, caused by hitting an abnormal memory allocation. As a result of this, the billing system was in a failing state. Other components of Server Density have a failsafe mode so that if calls to the billing system fail, they will continue processing regardless. However, this incident surfaced a previously unknown software bug in the notification service where this failsafe did not work. As such, the failed call to the billing service caused the notification service to halt processing thereby not sending any notifications.

A database failover is not an a critical alert because the system will recover itself. However, the billing system being unavailable is a critical state and we have alerting configured to detect this. We use both Server Density itself and a 3rd party, competitor monitoring service. Due to the notification service also failing, alerts from Server Density itself were not sent. This is why we have a 3rd party service configured to detect downtime independently. However, the alert did not trigger. This is because the database access is only requested after the initial authentication where the test hits. As such, our on-call team were unaware that the billing system was down. This is a design flaw in the status test which we have prioritised to fix.

Separately, we have end to end tests built into Server Density. These inject an agent payload into our agent intake, tracking the payload through the system with a series of integration tests to ensure that the expected behaviours occur. We test a range of conditions to provide diversity across code paths, ending in testing the notification service itself. The idea is to ensure that alerts get triggered and notifications can be sent.

Due to limitations at the time the test suite was originally written, we did not implement a full step by step test of the notifications service. Completing a test of every single notification provider to ensure successful delivery to the “end point” e.g. an email address, would take too long - executing the test itself by our own or 3rd party monitoring systems would time out. Discussion internally led to an implementation which would test the provider APIs to ensure that a sample notification could be successfully delivered, but not the notification being injected. This meant the end to end tests were end to end, except for the final step which were actually integration tests against the provider API. This was considered technical debt but not properly documented so the severity of the limitation was not fully understood by the management team, and so sufficient priority was not given to fixing the implementation.

This was reviewed following the incident on Feb 6, a new approach designed and a task created in our latest development cycle. The task was prioritised and work started on the new implementation on Feb 15. Unfortunately we did not have enough time to complete the development work before this latest incident.

Both incidents had separate root causes but the reason they lasted so long was a failure in the alerting setup across two separate providers. There was also a failure to properly understand and prioritise technical debt, and when the limitations were discovered we were unable to implement a fix quick enough before a second failure occurred.

Not delivering alerts is the worst type of system failure we plan for at Server Density. This is why we have multiple, separate monitoring systems to detect it. The failure of those systems compounded the severity of these outages because when our system fails, it is critical that we’re aware of it, can communicate the failure to customers and begin our incident response rapidly.

We’re very sorry about this series of events and hope that this explains what happened and what we have done to prevent it from happening again. Please email hello@serverdensity.com if you have any questions.

Posted Feb 21, 2017 - 17:48 GMT

Resolved
Today, between 00:11 and 15:17 UTC, notifications generated from alerts were queued but not sent. At this time the notification queue has been consumed with all pending notifications delivered. We will be publishing a postmortem of this incident on Monday with details of the work already ongoing to prevent a reoccurrence.
Posted Feb 18, 2017 - 16:05 GMT