Low number of payloads processed - nodata alerts are delayed
Incident Report for Server Density
Postmortem

On 6 occasions beginning 16 September 12:00 UTC and ending 21 September 20:00 UTC, we experienced extended packet loss and a reduction in server monitoring payload processing capacity. "no data received" alerts were also disabled to avoid false positives during these periods and gaps in graphs may have appeared due to dropped packets:

  • September 16 between 12:00 and 20:00 UTC
  • September 17 between 12:00 and 18:00 UTC
  • September 18 between 13:00 and 18:00 UTC
  • September 19 between 12:00 and 18:00 UTC
  • September 20 between 11:00 and 20:00 UTC
  • September 21 between 11:00 and 20:00 UTC

The first incident started at 12:00 on 16 September during which time our metrics indicated a failed node or lack of capacity within the processing cluster. This manifested as incorrectly replicated partitions within one of our Zookeeper clusters. The normal remediation for this is to do a rolling reboot of the cluster, which we initiated. One of the nodes failed to reboot and we proceeded to recover that node. Once it came back online, the incident appeared to resolve itself and so we closed out the incident.

The second incident started at the same time the following data. It manifested as degraded network capacity between two nodes. We have seen similar incidents in the past which are usually caused by noisy neighbours on the virtualised instances hosted by our cloud provider. We migrated the guests to other hosts and after that was completed, the incident appeared to resolve itself as with the previous day.

The two incidents on September 18 and 19 were very sporadic and only showed symptoms of a small drop in processed payloads. This triggered a failsafe mechanism to disable "no data received" alerts at different intervals during the time period. The short nature of the symptoms and the very low impact meant we were unable to find evidence of what was happening before the incident self-resolved itself as in the previous days.

On September 20, the incident occurred again but we were ready to collect the data we needed to pinpoint the problem, which we discovered as reduced internal networking capacity at our hosting vendor. Within an hour of the incident starting, we opened a ticket with them and requested an expedited escalation through our account manager. Unfortunately, the internal processes of the provider meant that the ticket was not properly escalated until it has been investigated by their multi-levels of support. By the time it reached the networking team, the incident had subsided again.

On September 21, the incident started at the same time as before and we highlighted this to our provider who again began their investigation. In the meantime, we deployed an emergency code change which would apply mitigation to the packet loss and high latency, thereby reducing the impact on customers.

A short time after the investigation was initiated by Softlayer, they determined that it was a major networking incident affecting multiple customers and began their incident response process. They applied mitigations which resolved the impact around 20:00 UTC.

We have been waiting for a full incident report from Softlayer for the last 2 weeks but they are still conducting their investigation. Initial discussions with their networking team indicate their network was at capacity during these times due to activity from another customer, and their monitoring failed to provide sufficient detail for them to either detect it themselves nor provide a fast diagnosis when we raised the issue with them. We have decided to publish this report to provide a timeline of the incident and will update this once we have a root cause analysis from Softlayer.

Posted Oct 03, 2016 - 21:34 BST

Resolved
This incident has been resolved.
Posted Oct 03, 2016 - 21:33 BST
Update
Our provider has completed emergency maintenance on a router identified as showing backplane congestion through this incident and affecting our network traffic.
We will monitor this throughout tomorrow to confirm the fix.
Posted Sep 21, 2016 - 22:17 BST
Monitoring
Payload intake and processing has just normalized. We still don't have a resolve statement from our provider so we will continue to monitor this issue for another 24h.
Posted Sep 21, 2016 - 21:19 BST
Identified
Our provider has just notified us they have identified the issue and are currently working on correction.
Posted Sep 21, 2016 - 19:30 BST
Update
Our provider continues to work on the issue. We have deployed a code change to minimize impact on device payloads processing which will minimize gaps on graphs for the small number of devices impacted.
Posted Sep 21, 2016 - 18:20 BST
Update
We're continuing to work with our provider to resolve this. We'll share more significant updates as and when we have them.
Posted Sep 21, 2016 - 17:09 BST
Update
We're continuing to work with our provider to resolve this.
Posted Sep 21, 2016 - 15:23 BST
Investigating
We're observing further problems with processing agent postbacks and we're working with our provider on the network degradation which is causing this incident.
Posted Sep 21, 2016 - 14:40 BST
Monitoring
As during the previous occurrences of this issue, payload intake has normalized and nodata alerts have enabled.
Our provider has not found the cause of this problem yet, they will continue to work on this. We'll continue to monitor the postback intake and processing closely.
Posted Sep 20, 2016 - 21:05 BST
Update
Our provider networking team is still looking into the cause of the degradation we're observing.
Posted Sep 20, 2016 - 20:02 BST
Update
We are continuing to work with our provider to track down the network performance degradation we are observing.
Posted Sep 20, 2016 - 18:00 BST
Investigating
We are observing a number of received device payloads lower than normal again and nodata alerts are disabled at the moment. The affected devices will show gaps on their graphs.
Our monitoring has also picked up degradation on the internal networking and we are currently working with our provider to find the cause and fix it.
Posted Sep 20, 2016 - 16:26 BST
Update
This issue has just cleared and nodata alerts are active again. We will continue to look into the root cause of this issue for another 24h.
Posted Sep 19, 2016 - 18:41 BST
Update
We are observing a number of received device payloads lower than normal again and nodata alerts are disabled at the moment.
Posted Sep 19, 2016 - 17:45 BST
Update
We have been observing the expected 5min wait time on nodata alerts for the past 3 hours, which means that received payloads have normalized. We will keep monitoring this issue for the next 24 hours.
Posted Sep 18, 2016 - 17:10 BST
Monitoring
We are observing a number of received device payloads lower than normal. The drop has been fluctuating between 1.5% and 2.5%. Bellow 2% our protection for false "no data" alerts triggers and delays these alerts delivery. So, instead of the builtin 5min wait we are seeing up to 15min wait.
We are continuing to monitor this issue and will update the status in about 3 hours. If it remains we'll proceed to adjust our protection threshold.
Posted Sep 18, 2016 - 12:23 BST