On January 27 at 12:55 UTC, in preparation for an upgrade procedure in our zookeeper and kafka clusters, we initiated a controlled shutdown of one of our zookeeper and kafka cluster members. Although we had previously done this operation successfully several times as part of our schedule maintenance runs, in this case it caused a loss of quorum on the cluster, leading to its shutdown and consequent stop of device and services payload processing at 13:05 UTC, which also reflected on a high error rate on UI and API requests.
Recovery action started immediately on the entire cluster, recovering service at 13:50 UTC.
Previously on http://status.serverdensity.com/incidents/w54vkg6zfdhd we learnt that we had to geographical distribute these clusters. Which we did. Those clusters have been running from 2 different DCs on the same region to keep zookeeper latency requirement. However, this incident shows us that's still not enough. We will expand this to multiple independent clusters across multiple DCs. This will also allow us to have those clusters further apart geographically as latency will mean less for independent cluster function.
In short, we need to extend our N+1 architecture (one more of everything) to these two functions. This will allow to create another layer of redundancy in this core part of our infrastructure - a cluster of clusters - minimizing service loss due to zookeeper quorum loss.