Server Density Status

Route leak

Incident Report for Server Density

Postmortem

Server Density uses CloudFlare in front of all our web traffic to provide performance enhancements, a CDN and security functionality. Worldwide traffic is directed to the closest CloudFlare data centre which then proxies the traffic to our own infrastructure.

At 15:29 UTC a large Qatar based ISP misconfigured their BGP announcements resulting in a route leak. The Internet is built on BGP trust between networks (or AS numbers), so if an ISP incorrectly announces an IP address (or in this case a number of prefixes), upstream networks or peers can incorrectly route packets. This change propagated to large carriers such as NTT, TeliaSonera and Level 3 meaning the impact was much larger, and not isolated to Doha, Qatar.

BGPmon (used to assess routing health) notified CloudFlare of the route leak at 15:41 UTC and at 15:42 UTC the Doha POP was removed from production, and requests were sent via other locations. Network engineers reached out to inform the ISP of their misconfiguration and the changes were reverted at 16:04 UTC. Routing was back to normal at 16:08 UTC.

An external timeline and visualisation of this route leak is available at https://bgpstream.com/event/2424 which details the changes.

Between 15:29 UTC and 16:05 UTC, a subset of Server Density monitored devices may have been unable to send post backs to our global endpoints. According to our monitoring, this affected around 20% of customer devices.

Unfortunately, route leaks are notoriously difficult to mitigate due to the fundamental design of the Internet BGP architecture. Some systems we rely on are outside our control but it is still ultimately our responsibility to ensure the availability of our services. We will be evaluating how we can minimise the impact of similar incidents in the future.

In the more medium term, CloudFlare are evaluating IP renumbering for select prefixes, are providing more proactive training for their peers and long term working on reducing the impact caused by route leaks.

Posted Oct 12, 2015 - 14:02 BST

Resolved

Our provider has declared the incident resolved and we have confirmed normal values from our monitoring for the past 3 hours.

Posted Oct 05, 2015 - 20:30 BST

Monitoring

Our provider has reported a fix implemented and we're seeing normal values on our monitoring as well. We are now activating nodata alerts and will continue to monitor the situation together with our provider.
A little under of 20% of customer devices were affected by this situation - depending on the geographical location of the device.

Posted Oct 05, 2015 - 17:40 BST

Identified

Our upstream provider has confirmed the route leak and are proceeding to resolve:
https://www.cloudflarestatus.com/incidents/3zcnm4rnl0vv
We have deactivated nodata alerts while this incident lasts to prevent false positives.

Posted Oct 05, 2015 - 17:17 BST

Investigating

We're currently seeing connectivity failures from our POPs in London, Sao Paolo, Johannesburg, Doha, Frankfurt, and Berlin caused by an upstream provider route leak. This may cause failed postbacks. We're currently investigating and will have more details shortly.

Posted Oct 05, 2015 - 16:57 BST