UI, API and Alerting unavailable

Incident Report for Server Density

Postmortem

On January 27 at 12:55 UTC, in preparation for an upgrade procedure in our zookeeper and kafka clusters, we initiated a controlled shutdown of one of our zookeeper and kafka cluster members. Although we had previously done this operation successfully several times as part of our schedule maintenance runs, in this case it caused a loss of quorum on the cluster, leading to its shutdown and consequent stop of device and services payload processing at 13:05 UTC, which also reflected on a high error rate on UI and API requests.

Recovery action started immediately on the entire cluster, recovering service at 13:50 UTC.

Previously on http://status.serverdensity.com/incidents/w54vkg6zfdhd we learnt that we had to geographical distribute these clusters. Which we did. Those clusters have been running from 2 different DCs on the same region to keep zookeeper latency requirement. However, this incident shows us that's still not enough. We will expand this to multiple independent clusters across multiple DCs. This will also allow us to have those clusters further apart geographically as latency will mean less for independent cluster function.

In short, we need to extend our N+1 architecture (one more of everything) to these two functions. This will allow to create another layer of redundancy in this core part of our infrastructure - a cluster of clusters - minimizing service loss due to zookeeper quorum loss.

Posted Feb 03, 2016 - 12:22 GMT

Resolved

72 hours have passed since we observed the last occurrence of this incident and since we replaced the high latency server. We are considering this resolved and will be publishing a post-mortem in the next few days.

Posted Feb 01, 2016 - 18:01 GMT

Update

We had to replace the high latency cluster server. We're now monitoring if the replacement is free from that behavior.

Posted Jan 29, 2016 - 21:55 GMT

Update

We're continuing to monitor this we saw a re-occurrence of the high latency earlier today which prevented us from re-adding the removed cluster member. We are working with our provider to solve this issue.

All systems are operational and there has not been new impact to service since yesterday 16:30 UTC

Posted Jan 28, 2016 - 21:31 GMT

Update

We observed high network latency to one member of this cluster. This member has now been removed, restoring normal cluster function. We'll continue to monitor this situation.

Posted Jan 27, 2016 - 17:02 GMT

Update

We are still keeping this incident open because we just had a stall on that same cluster that could have caused some device no data alerts to trigger.

Posted Jan 27, 2016 - 16:33 GMT

Monitoring

We have now finished recovering this cluster and we'll be monitoring its health for the next hours.

Posted Jan 27, 2016 - 13:57 GMT

Investigating

We're still working on the recovery of the affected cluster and will share updates shortly.

Posted Jan 27, 2016 - 13:35 GMT

Identified

We are currently experiencing a failure in one of our distributed synchronization clusters which is preventing us from serving UI or API requests. Work is ongoing to recover that cluster.

Posted Jan 27, 2016 - 13:07 GMT