We use CloudFlare for our primary DNS to be able to take advantage of their global network of 100+ POPs and the additional CDN and security functionality they offer.
CloudFlare uses a bespoke DNS server to power its anycast network. On 2016-09-14 07:00 UTC, A code path was unintentionally triggered, which caused the DNS server software to crash on some nodes in six of their points of presence (PoPs) San Jose, CA; Los Angeles, CA; Chicago, IL; Hong Kong; Taipei, Taiwan; and Paris, France. This was escalated to their DNS engineers who identified the bad code path and advised the oncall engineer on how to mitigate it.
This issue caused severe degradation, and, in some cases, total unavailability of DNS resolution in the above listed CloudFlare data centers. At peak, rates CloudFlare was registering a total of 600k DNS resolution failures per second across the affected PoPs. Recursive resolution not severely affected due to caching. Some HTTP endpoints that depend on DNS resolution are impacted in SJC.
Cloudflare SRE is in the process of patching the bad code path, ensuring that the DNS software will not crash again from this fault. In addition, SRE has informed relevant teams on how to avoid triggering this code path and preventing reoccurrence of this issue going forward.
Server Density followup
As this was a failure at our DNS provider and we use a low TTL for failover purposes, there are no immediate remediation steps that we can take at this time. Using multiple DNS providers has been considered but the complexity of ensuring proper sync of DNS zones and the additional CDN and security functionality offered by Cloudflare makes this option difficult.