Starting earlier today (exact time to be confirmed), notifications sent to PagerDuty via our integration with their API were not successfully delivered. All other notification types were unaffected.
We have a range of mechanisms in place to ensure alerts get delivered correctly but during this time, all calls to the PagerDuty API were successful and returning valid API responses plus incident IDs, so our own monitoring was showing everything was working correctly.
After we manually noticed the missing alerts at 18:35 UTC, we notified PagerDuty and they diagnosed and pushed out a fix at around 21:30 UTC.
We put a lot of time into ensuring alert delivery so this has highlighted that we need to do further checks against 3rd party integrations to ensure that when they say they have received the alert data, that the next expected actions happen. In PagerDuty's case, we will be implementing regular, automated checks against their API to send test events and then ensure they are actually created as incidents within PagerDuty.
We are waiting for a root cause analysis and time period for the incident from PagerDuty and will post further information once we have it.
Sep 11, 23:20 BST