Service monitoring delays

Incident Report for Server Density

Postmortem

On March 18 between 15:41 and 17:21 UTC most Server Density service web checks failed to be scheduled to run.

This was caused by a code change in how the web checks component retrieves and dispatches service web checks to our worldwide testing locations. This component had a built in caching mechanism to allow faster retrieval of the tests to be executed and it was this mechanism that failed during this incident.

Following this incident, we have improved the response times and throughput capacity of the other Server Density components, that were the initial cause for that caching mechanism to be in place. This has allowed us to simplify the component by removing that caching mechanism altogether, making sure this issue would not happen again on account of a failed cache.

Posted Apr 26, 2018 - 10:28 BST

Resolved

We consider this now resolved.

Posted Apr 18, 2018 - 18:54 BST

Monitoring

We have pushed a correction and are monitoring the results which, at this time, indicate this incident is solved.

Posted Apr 18, 2018 - 18:27 BST

Identified

We are currently experiencing a delay in running service web checks.
This has been identified as a code problem and we are working to revert and normalize the service web checks execution.
This impacts alerts on service web checks only. Device alerting is not impacted.

Posted Apr 18, 2018 - 18:11 BST