Payload processing errors
Incident Report for Server Density
Postmortem

On October 23 at 19:45 UTC, in preparation for an announced maintenance operation across all our provider data centers where we deploy to, we initiated a controlled shutdown of one of our zookeeper and kafka cluster members. This shutdown was required to move that member to another data center pending the announce provider maintenance. Although this operation had previously been tested successfully, it caused a loss of quorum on the cluster, leading to its shutdown and consequent stop of device and services payload processing at 19:50 UTC.

Recovery action started immediately, however initially focused on recovering that single cluster member. That caused significant delay in dealing with the root cause of the problem. Subsequent analysis uncovered data corruption on other cluster members and, given the duration of the incident, it was decided to rebuild the cluster to restore service in a known time frame, rather than trying to recover from the existing errors. As this cluster does not store any customer data, it is a relatively simple operation to rebuilt it. That work was completed and payload processing and service restored at 21:38 UTC. At 23:09, payload processing was paused to allow reconfiguration of that cluster. At 23:41 UTC, that work was completed ending this incident.

Up to this point we had relied on independent zookeeper and kafka clusters deployed on geographical distant data centers, but each cluster deployed to a single data center. This was mostly due of latency related limitations from within zookeeper and kafka themselves. However, following this event, and making use of <5ms distances to the next data center of our provider, we will be improving this architecture by deploying the same cluster across multiple data centers. This change will allow better resilience and fully automated failover in similar circumstances.

Posted Oct 27, 2015 - 11:39 GMT

Resolved
We have now completed the maintenance procedures that will allow Server Density to perform as normal during the next days, when some of our servers will be rebooted by our provider in an ongoing Xen maintenance.

A post mortem of this complete operation will be publish in the next few days.
Posted Oct 24, 2015 - 01:16 BST
Monitoring
The affected cluster has now been recovered and request processing is back to normal. We will be monitoring the situation closely also because the identified trigger so far has been our maintenance procedures in preparation for our provider Xen reboots happening the next days. These have been paused for the moment but will resume shortly.
Posted Oct 23, 2015 - 23:00 BST
Identified
We are currently experiencing a failure in processing requests. This has been caused by the loss of quorum on one of our processing clusters and we are currently working to recover from that error.
Posted Oct 23, 2015 - 21:43 BST