High postback error rate

Incident Report for Server Density

Postmortem

On April 19, between 11:00 and 11:17 UTC, following a routing issue with the public network connectivity at our primary data center, we failed to receive server monitoring agent device payloads which then triggered "no data received" alerts for affected customers.

Our initial response at 11:11 UTC was to manually disable "no data received" alerting for all customers, after which we escalated the issue to our networking provider who informed us about the known networking problems. We continued to observe networking failures for several hours after the routing issue was marked as resolved by our provider, subsequently re-enabling "no data received" alerting at 14:29 UTC.

This incident had two facets to it: 1. Public internet connectivity failure. This meant that server monitoring agent payloads were unable to reach our agent endpoints over the public internet. Public network connectivity became degraded beginning at approximately 19-Apr-2017 11:00 UTC. An upstream provider began advertising additional routes to the network not normally learned from them. Due to the routing policy, these routes were preferred over other existing ones. This caused traffic to shift to this provider. The sudden increase in traffic resulted in the links to them to become congested. At approximately 19-Apr-2017 11:40 UTC the provider corrected their announcements normalizing network connectivity. The networking provider will be auditing all peering links to ensure that routing priorities and prefix limits are set for each session to ensure that upstream providers are limited to advertising the correct net blocks. However, in the medium term we are in the process of migrating to a new provider with a more modern infrastructure, better operational practices, better networking capacity and an overall more modern architecture. Although not a direct root cause, this will reduce the likelihood of such a problem happening again. 2. "no data received" alerting to large numbers of customers. We have a protection mechanism in place to prevent sending of mass "no data received" alerts when such global events happen. This mechanism failed as a consequence of a bug introduced in a recent release. A fix for this bug has been deployed and our test suite extended to cover this scenario to prevent future regressions.

Posted May 02, 2017 - 17:46 BST

Resolved

We have not seen a re-occurrence of this issue again.
Also, we have deployed a code fix to prevent false nodata alerts if the received postbacks drop below a certain number.

We'll be releasing a postmortem about this incident in the next few days.

Posted Apr 21, 2017 - 20:43 BST

Monitoring

Response times to our postbacks endpoint has normalized. We will continue to monitor this and confirm with our provider if this has been a re-ocurrence. If you got a false nodata alert between 00:00 and 05:00 UTC today please report to hello@serverdensity.com.

Posted Apr 21, 2017 - 07:11 BST

Identified

In the last hours we have seen elevated response times to our postback endpoint again. We're actively investigating the cause to determine if the same or a new issue. If you got a false nodata alert please report to hello@serverdensity.com.

Posted Apr 21, 2017 - 06:24 BST

Update

We're keeping this open a few more hours because we observed some transient timeouts during this morning.

Posted Apr 20, 2017 - 14:04 BST

Monitoring

Our provider reported: "An upstream transit provider incorrectly advertised a route that caused customer traffic to be incorrectly sent to that provider which would have caused customers to be unable to reach multiple datacenters for the duration of the event. Routing was corrected by the upstream transit provider at approximately 11:40 UTC and services should have begun to to stabilize at that time."

We have confirmed expected monitoring values in the last 40 minutes and have enabled nodata alerting at 14:30 UTC.

During the next hours we will continue to monitor networking parameters for re-occurrences.

Posted Apr 19, 2017 - 15:37 BST

Update

Our provider has informed that a routing anomaly is causing the higher than normal latency and timeouts we've been seeing and is working to identify its source. We will continue to keep nodata alerts disabled.

Posted Apr 19, 2017 - 14:19 BST

Update

Our provider has identified the ongoing problem and is working to restore full service. We are still seeing some network degradation and we are keeping nodata alerts disabled to prevent further occurrences from false nodata triggers. Alerting delays, if any, are residual now.

Posted Apr 19, 2017 - 13:45 BST

Update

We are continuing to work with our provider on this issue. We are seeing a reduced error rate but haven't received confirmation yet. We're currently keeping nodata alerts disabled to prevent further occurrences from false nodata triggers.

Posted Apr 19, 2017 - 13:07 BST

Identified

We have identified network degradation on our public Internet uplinks. We're are reaching to our provider on this.
This is causing gaps on graphs, delayed alerting triggers and false nodata on some devices.

Posted Apr 19, 2017 - 12:30 BST

Investigating

We're currently investigating a high error rate on postback intake

Posted Apr 19, 2017 - 12:10 BST