On April 19, between 11:00 and 11:17 UTC, following a routing issue with the public network connectivity at our primary data center, we failed to receive server monitoring agent device payloads which then triggered "no data received" alerts for affected customers.
Our initial response at 11:11 UTC was to manually disable "no data received" alerting for all customers, after which we escalated the issue to our networking provider who informed us about the known networking problems. We continued to observe networking failures for several hours after the routing issue was marked as resolved by our provider, subsequently re-enabling "no data received" alerting at 14:29 UTC.
This incident had two facets to it: 1. Public internet connectivity failure. This meant that server monitoring agent payloads were unable to reach our agent endpoints over the public internet. Public network connectivity became degraded beginning at approximately 19-Apr-2017 11:00 UTC. An upstream provider began advertising additional routes to the network not normally learned from them. Due to the routing policy, these routes were preferred over other existing ones. This caused traffic to shift to this provider. The sudden increase in traffic resulted in the links to them to become congested. At approximately 19-Apr-2017 11:40 UTC the provider corrected their announcements normalizing network connectivity. The networking provider will be auditing all peering links to ensure that routing priorities and prefix limits are set for each session to ensure that upstream providers are limited to advertising the correct net blocks. However, in the medium term we are in the process of migrating to a new provider with a more modern infrastructure, better operational practices, better networking capacity and an overall more modern architecture. Although not a direct root cause, this will reduce the likelihood of such a problem happening again. 2. "no data received" alerting to large numbers of customers. We have a protection mechanism in place to prevent sending of mass "no data received" alerts when such global events happen. This mechanism failed as a consequence of a bug introduced in a recent release. A fix for this bug has been deployed and our test suite extended to cover this scenario to prevent future regressions.