On 6 occasions beginning 16 September 12:00 UTC and ending 21 September 20:00 UTC, we experienced extended packet loss and a reduction in server monitoring payload processing capacity. "no data received" alerts were also disabled to avoid false positives during these periods and gaps in graphs may have appeared due to dropped packets:
The first incident started at 12:00 on 16 September during which time our metrics indicated a failed node or lack of capacity within the processing cluster. This manifested as incorrectly replicated partitions within one of our Zookeeper clusters. The normal remediation for this is to do a rolling reboot of the cluster, which we initiated. One of the nodes failed to reboot and we proceeded to recover that node. Once it came back online, the incident appeared to resolve itself and so we closed out the incident.
The second incident started at the same time the following data. It manifested as degraded network capacity between two nodes. We have seen similar incidents in the past which are usually caused by noisy neighbours on the virtualised instances hosted by our cloud provider. We migrated the guests to other hosts and after that was completed, the incident appeared to resolve itself as with the previous day.
The two incidents on September 18 and 19 were very sporadic and only showed symptoms of a small drop in processed payloads. This triggered a failsafe mechanism to disable "no data received" alerts at different intervals during the time period. The short nature of the symptoms and the very low impact meant we were unable to find evidence of what was happening before the incident self-resolved itself as in the previous days.
On September 20, the incident occurred again but we were ready to collect the data we needed to pinpoint the problem, which we discovered as reduced internal networking capacity at our hosting vendor. Within an hour of the incident starting, we opened a ticket with them and requested an expedited escalation through our account manager. Unfortunately, the internal processes of the provider meant that the ticket was not properly escalated until it has been investigated by their multi-levels of support. By the time it reached the networking team, the incident had subsided again.
On September 21, the incident started at the same time as before and we highlighted this to our provider who again began their investigation. In the meantime, we deployed an emergency code change which would apply mitigation to the packet loss and high latency, thereby reducing the impact on customers.
A short time after the investigation was initiated by Softlayer, they determined that it was a major networking incident affecting multiple customers and began their incident response process. They applied mitigations which resolved the impact around 20:00 UTC.
We have been waiting for a full incident report from Softlayer for the last 2 weeks but they are still conducting their investigation. Initial discussions with their networking team indicate their network was at capacity during these times due to activity from another customer, and their monitoring failed to provide sufficient detail for them to either detect it themselves nor provide a fast diagnosis when we raised the issue with them. We have decided to publish this report to provide a timeline of the incident and will update this once we have a root cause analysis from Softlayer.