Payload processing stall

Incident Report for Server Density

Postmortem

On June 11 between 15:21 and 16:29 UTC we were unable to process incoming device payloads or web check results. This was a result of a global persistence storage failure at our provider that impacted both our payload processing clusters, designed to be redundant of each other.

Our provider has released the following statement about this outage: "The issue arose from a dependency conflict expressed in our current persistent disk implementation. This same conflict is soon to be fixed in the next iteration, the timeline of that we cannot release. The next implementation is more fault tolerant and reliable to prevent issues like this in the future, as well we have monitoring in place to reduce the risk of this affecting customers whilst we provide a permanent fix."

Posted Jun 28, 2018 - 21:17 BST

Resolved

We have completed our post-incident verification and confirmed we have fully recovered at 17:11 UTC.

Posted Jun 11, 2018 - 19:28 BST

Monitoring

We have been able to restore service to the second payload processing cluster and we're processing real time payloads now.
We are now monitoring the system for full recovery.

Posted Jun 11, 2018 - 17:53 BST

Update

We have been able to restore one of the payload processing clusters. This restores service to a point of some data to be graphed and alerts processing should be back momentarily.

Posted Jun 11, 2018 - 17:41 BST

Identified

We have identified a storage issue impacting our two payload processing cluster and are working to recover.

Posted Jun 11, 2018 - 17:05 BST

Investigating

We're currently investigating a failure to process device payloads. Alerting is also impacted.

Posted Jun 11, 2018 - 16:29 BST

This incident affected: Alerting, Agent payloads, and Availability monitoring.