Reduction in payload processing capacity

Incident Report for Server Density

Postmortem

On July 6th at 07:45 UTC we added capacity to mitigate a data request spike that impacted payload processing capacity. This may have shown on some customer devices as graph gaps and delayed alerting.

The next day, sometime after 07:00 UTC, we saw the same event happening again. We applied the same mitigation and were able to reduce the impact to a few minutes between 07:00 and 07:11 UTC.

In the following couple of days we saw the same event and were able to eliminate impact by keeping the mitigation in place. We then were able to identify a set of abnormal API calls to https://api.serverdensity.io. These were legitimate customer requests adhering to the API specification but their timing and size characteristics propagated an impact to the payload processing engine.

In these past weeks, we have worked with the requesting customers and reviewed our API implementation to prevent this issue from happening again.

Posted Aug 08, 2017 - 14:00 BST

Resolved

This incident has been resolved.

Posted Aug 08, 2017 - 13:58 BST

Monitoring

Today we have confirmed the source of the request spike causing this incident.
After today we don't expect a re-occurrence and we have moved the incident state to "Monitoring" while we work with the source of these requests to remove it's negative impact.

Posted Jul 11, 2017 - 08:19 BST

Update

During there last occurrences we narrowed down the cause of the request spike as coming from our api (api.serverdensity.io) and not from eg. the user facing app or incoming device payloads.
Today we were able to prevent the daily 07:00 UTC occurrence by blocking a set of suspect API calls. This has reduced the issue scope even further, putting us closer to a solution. Today's impact was a 4 minute unavailability (06:58 - 07:02) on that set of API calls.

Posted Jul 09, 2017 - 08:15 BST

Update

We have kept this incident open this long as this is an event only happening at 07:00 UTC, preventing us from continuously verifying possible corrections. We are continuing to work on it.
We'll update this again tomorrow after 07:00 UTC.

Posted Jul 08, 2017 - 08:37 BST

Update

Between 07:00 and 07:11 UTC we had a re-occurrence of this incident. The consequence was immediately mitigated but we are still following up on the root cause of this data request spike.

Posted Jul 07, 2017 - 08:24 BST

Update

Payload processing is normal since 08:15 UTC. We're continuing to work on the cause the observed request spike.

Posted Jul 06, 2017 - 12:02 BST

Identified

We have identified a reduction in our device payload processing capacity caused by an abnormal data request. This may show on some devices as missing metrics data. Alerting is not affected.
We've adjusted capacity while we identify and resolve the request spike.

Posted Jul 06, 2017 - 08:49 BST