Apr 29, 12:14 BST
We have not seen further occurrences of this issue. Our provider has declared it fixed and released a Reason for Outage, that we'll be reviewing and publishing as part of the full post-mortem.
Apr 19, 12:50 BST
The original cause of this incident was resolved this morning at 04:00 UTC when Softlayer networking engineers updated the firmware on the affected switches. However, at 06:16 UTC work being performed on a redundant backbone router in the LAX01 PoP caused a high rate of internal routing updates to the US-based route reflectors. As a result, the route reflectors experienced instability in the routing process, which ultimately caused that software component to crash.
Regional traffic within Europe, Asia, India would have for the most part not have been affected by this incident. Likewise, traffic within a given city in the US datacenters would have remained unaffected, and most Internet ingress/egress to datacenters in North and South America also continued without notable disruption.
The primary impact was traffic between datacenters, where at least one of the datacenters was located in North or South America. Both public and private network traffic within those regions would have been similarly impacted.
As network engineers normalized the LAX01 router, the high rate of routing message updates ceased, and the route reflectors stabilized without any further manual intervention. Once stabilized, normal routing resumed shortly thereafter.
Since we deploy across multiple datacenters in North America (Washington, San Jose and Toronto), our public and private networking traffic was affected. This caused service unavailability between 06:16 UTC and 06:52 UTC.
We will be working with Softlayer to understand the root cause of both of these incidents and will publish a post mortem within the next few days.
Apr 13, 08:15 BST
We will continue to monitor this issue and work with SoftLayer to get confirmation that networking capability at our datacenters is fully restored.
Apr 13, 08:01 BST
At this time we have a full recovery and will also be enabling device nodata alerting. Between 06:30 and 06:35, before we disabled this alerting, some nodata alerts may have been sent for devices that we had not yet restored payload processing. These would have been cleared before 06:40.
Apr 13, 07:56 BST
We've just got confirmation that the SoftLayer network in the US began experiencing routing instability earlier. The primary impact to us is that traffic into / out of our datacenters have experienced a disruption.
At this time we were able to recover some service. Alerting is restored except for device nodata alerts that we are keeping off until we are back to normal payload processing.
Apr 13, 07:37 BST
At this time our payload processing has stalled again. We are investigating if this occurrence is a result of our provider network recovery attempts.
Apr 13, 07:16 BST
While network engineers in one of our Softlayer data centres were performing an emergency maintenance to correct a switching issue between the two chassis, an errant command was applied. The errant command was corrected shortly afterwards, at which time normal traffic flow was restored.
This command caused the router to switch some traffic through a sub-optimal path for a portion of the aggregate switches downstream of this router. As a result, some of our backend servers experienced increased latency on the private network.
About an hour after this, the increased latency recurred due to a bug in the switch firmware. Network engineers rapidly responded to manually clear out the incident and resume normal traffic forwarding. This caused increased latency on the backend network and slowed down our alert processing.
After the first occurrence, the bug triggered a further 3 times but network engineers were able to clear it within a minute each time.
The hardware vendor has identified the bug which is being triggered, and has recommended a course of action. Network management are reviewing the recommendation to determine the best implementation plan.
Network engineers will continue to monitor closely to ensure rapid remediation for future recurrences. We will provide another update as we have more information.
Apr 12, 08:22 BST
The root cause of this incident has been identified as increased network latency on switches that power our internal networking. Our provider is working to restore normal service. Until that happens we are expecting further occurrences of this issue and will update this when they happen.
Apr 12, 00:19 BST
We have not yet concluded work on this issue but all systems have been back to normal for the past hour. We'll continue to monitor overnight.
Apr 11, 23:23 BST
payload processing is back to normal for the past 30 min.. We'll continue to investigate the issue.
Apr 11, 22:46 BST
We are experiencing a delay in payload processing and we are investigating the cause. This will result in alerting delays.
Apr 11, 22:18 BST