We now consider this incident resolved. Our networking provider, Softlayer, has released the following detail - WDC04 is our second data center where the affected cluster is located:
"Starting at around 16-Mar-2017 20:17 UTC the Network Operations Center was notified that some customers were experiencing loss of ARP to their gateway in the WDC04 datacenter due to a high rate of anomalous traffic seen originating from a customer environment. As the issue was identified, preventative action was taken by network engineers to mitigate the traffic. During the initial 11 minute period, customer servers in WDC04 may have seen higher than normal latency or packet loss.
A second ocurrence of the event was seen at 17-Mar-2017 15:35 UTC. Network Engineers mitigated the second occurrence at 16:08 UTC."
When asked about the work done on this, Softlayer has released the following additional information:
"At this point this issue has had several eyes on it from the networking team and they are fully aware of the issues. We take issues like this very seriously and our networking team is doing everything they can on their end to keep issues like this to an absolute minimum. We learn from every mistake and try to continually improve our methods for detecting and tracking these issues.
Although we can not guarantee that this will never happen again, our networking team is doing everything they can on their end to keep things like this from re-occurring. Unfortunately there are limits to what we can release in regards to our mitigation strategies for security reasons rest assured that we are trying our best to keep issues like this to an absolute minimum. "
Mar 18, 22:32 GMT
We have confirmed a network failure on one of our payload processing clusters starting at 15:20. Network connectivity is now restored and we have brought back this cluster into service. For the next hours, we will be monitoring this issue closely for recurrences as well as working with our provider to understand the root cause.
Mar 17, 16:46 GMT
We have identified a networking failure in one of our payload processing clusters. We have now disabled it pending further information from our infrastructure provider.
This will show as some gaps in graphs between 15:23 UTC and 15:56 UTC.
Alerting is not affected.
Mar 17, 16:13 GMT
We are seeing a payload processing slowdown on one of our clusters and are investigating.
Mar 17, 15:45 GMT