DNS issues
Incident Report for Server Density
Postmortem

We use CloudFlare for our primary DNS to be able to take advantage of their global network of 100+ POPs and the additional CDN and security functionality they offer.

CloudFlare uses a bespoke DNS server to power its anycast network. On 2016-09-14 07:00 UTC, A code path was unintentionally triggered, which caused the DNS server software to crash on some nodes in six of their points of presence (PoPs) ­ San Jose, CA; Los Angeles, CA; Chicago, IL; Hong Kong; Taipei, Taiwan; and Paris, France. This was escalated to their DNS engineers who identified the bad code path and advised the on­call engineer on how to mitigate it.

This issue caused severe degradation, and, in some cases, total unavailability of DNS resolution in the above listed CloudFlare data centers. At peak, rates CloudFlare was registering a total of 600k DNS resolution failures per second across the affected PoPs. Recursive resolution not severely affected due to caching. Some HTTP endpoints that depend on DNS resolution are impacted in SJC.

  • 07:00 UTC, impact started and alerting triggered
  • 07:15 UTC, SRE starts investigations
  • 07:43 UTC, SRE adds (host firewall) rules to limit requests per second for specific clients.
  • 07:56 UTC, public status announced: https://www.cloudflarestatus.com/incidents/nmcvfb923w0b
  • 07:57 UTC, Singapore SRE shift escalates to UK SRE via phone. UK SRE shift have already arrived at the office and begin investigating.
  • 08:02 UTC, SRE deploy additional firewall rules in an attempt to mitigate impact.
  • 08:35 UTC, SRE drop DNS from LAX to test a hypothesis.
  • 09:15 UTC, SRE identify the root cause of the issue (a code path was unintentionally triggered which caused the DNS server software to crash on some nodes) and identify a workaround by manually adjusting a database record. This workaround propagates to the edge within a few seconds, recoveries are seen shortly after.
  • 09:16 UTC, SRE begin restarting DNS instances that are non-responsive.
  • 09:30 UTC, Normal service restored.

Cloudflare SRE is in the process of patching the bad code path, ensuring that the DNS software will not crash again from this fault. In addition, SRE has informed relevant teams on how to avoid triggering this code path and preventing re­occurrence of this issue going forward.

Server Density followup

As this was a failure at our DNS provider and we use a low TTL for failover purposes, there are no immediate remediation steps that we can take at this time. Using multiple DNS providers has been considered but the complexity of ensuring proper sync of DNS zones and the additional CDN and security functionality offered by Cloudflare makes this option difficult.

Posted Oct 04, 2016 - 09:42 BST

Resolved
This has now been resolved. We will share the detailed cause as soon as our provide releases them.
Posted Sep 14, 2016 - 12:26 BST
Monitoring
Our provider has implemented a fix and we are monitoring the results.
Posted Sep 14, 2016 - 11:03 BST
Update
We have received an updated affected location list:
Europe (CDG - Paris, France), North America (LAX - Los Angeles, CA, United States, ORD - Chicago, IL, United States, SJC - San Jose, CA, United States), and Asia (HKG - Hong Kong, Hong Kong, TPE - Taipei, Taiwan).
Posted Sep 14, 2016 - 09:40 BST
Identified
Our DNS provider, Cloudflare, is experiencing DNS resolution issues in Los Angeles, Chicago, San Jose, Hong Kong, Paris & Taipei.
We have also confirmed a drop in received payloads and our "no data alert protection", that delays delivery of "no data" alerts when our inbound payload volume drops by 2%, activated momentarily.
Posted Sep 14, 2016 - 09:15 BST