On October 23 at 19:45 UTC, in preparation for an announced maintenance operation across all our provider data centers where we deploy to, we initiated a controlled shutdown of one of our zookeeper and kafka cluster members. This shutdown was required to move that member to another data center pending the announce provider maintenance. Although this operation had previously been tested successfully, it caused a loss of quorum on the cluster, leading to its shutdown and consequent stop of device and services payload processing at 19:50 UTC.
Recovery action started immediately, however initially focused on recovering that single cluster member. That caused significant delay in dealing with the root cause of the problem. Subsequent analysis uncovered data corruption on other cluster members and, given the duration of the incident, it was decided to rebuild the cluster to restore service in a known time frame, rather than trying to recover from the existing errors. As this cluster does not store any customer data, it is a relatively simple operation to rebuilt it. That work was completed and payload processing and service restored at 21:38 UTC. At 23:09, payload processing was paused to allow reconfiguration of that cluster. At 23:41 UTC, that work was completed ending this incident.
Up to this point we had relied on independent zookeeper and kafka clusters deployed on geographical distant data centers, but each cluster deployed to a single data center. This was mostly due of latency related limitations from within zookeeper and kafka themselves. However, following this event, and making use of <5ms distances to the next data center of our provider, we will be improving this architecture by deploying the same cluster across multiple data centers. This change will allow better resilience and fully automated failover in similar circumstances.