All Systems Operational
Alerting ? Operational
Alert Delivery Operational
SMS Operational
E-mail Operational
PagerDuty (Incident Creation) Operational
PagerDuty (Notification Delivery) Operational
Slack Operational
Webhooks Operational
HipChat Operational
Push notifications (global) ? Operational
Push notifications (iOS) Operational
Push notifications (Android) Operational
Agent payloads ? Operational
API Operational
Availability monitoring ? Operational
Web UI Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Mar 23, 2017

No incidents reported today.

Mar 22, 2017

No incidents reported.

Mar 21, 2017

No incidents reported.

Mar 20, 2017

No incidents reported.

Mar 19, 2017

No incidents reported.

Mar 18, 2017
Resolved - We now consider this incident resolved. Our networking provider, Softlayer, has released the following detail - WDC04 is our second data center where the affected cluster is located:

"Starting at around 16-Mar-2017 20:17 UTC the Network Operations Center was notified that some customers were experiencing loss of ARP to their gateway in the WDC04 datacenter due to a high rate of anomalous traffic seen originating from a customer environment. As the issue was identified, preventative action was taken by network engineers to mitigate the traffic. During the initial 11 minute period, customer servers in WDC04 may have seen higher than normal latency or packet loss.

A second ocurrence of the event was seen at 17-Mar-2017 15:35 UTC. Network Engineers mitigated the second occurrence at 16:08 UTC."

When asked about the work done on this, Softlayer has released the following additional information:
"At this point this issue has had several eyes on it from the networking team and they are fully aware of the issues. We take issues like this very seriously and our networking team is doing everything they can on their end to keep issues like this to an absolute minimum. We learn from every mistake and try to continually improve our methods for detecting and tracking these issues.

Although we can not guarantee that this will never happen again, our networking team is doing everything they can on their end to keep things like this from re-occurring. Unfortunately there are limits to what we can release in regards to our mitigation strategies for security reasons rest assured that we are trying our best to keep issues like this to an absolute minimum. "
Mar 18, 22:32 GMT
Monitoring - We have confirmed a network failure on one of our payload processing clusters starting at 15:20. Network connectivity is now restored and we have brought back this cluster into service. For the next hours, we will be monitoring this issue closely for recurrences as well as working with our provider to understand the root cause.
Mar 17, 16:46 GMT
Identified - We have identified a networking failure in one of our payload processing clusters. We have now disabled it pending further information from our infrastructure provider.
This will show as some gaps in graphs between 15:23 UTC and 15:56 UTC.
Alerting is not affected.
Mar 17, 16:13 GMT
Investigating - We are seeing a payload processing slowdown on one of our clusters and are investigating.
Mar 17, 15:45 GMT
Mar 16, 2017

No incidents reported.

Mar 15, 2017

No incidents reported.

Mar 14, 2017

No incidents reported.

Mar 13, 2017

No incidents reported.

Mar 12, 2017

No incidents reported.

Mar 11, 2017

No incidents reported.

Mar 10, 2017

No incidents reported.

Mar 9, 2017

No incidents reported.