Payload processing delay

Incident Report for Server Density

Postmortem

Between April 11th 18:47 UTC and April 13th 06:13 UTC we experienced two distinct incidents on our internal network.

The first incident, on April 11th, caused high latency between a significant number of our servers, which degraded our service, delaying our processing of payloads and, consequently, alerting. Our provider SoftLayer released the following reason for outage:

At 11-Apr-2016 18:47 UTC, the SoftLayer Network Operations Center began receiving reports of increased traffic latency to hosts behind bcr03.wdc01. 

Network Engineers began to investigate and discovered that some of the traffic egressing the router chassis pair was improperly being switched across the “side link” between the two chassis, taking a sub-optimal path. Once identified, they were able to force the router to re-program it’s traffic forwarding path to take the direct egress links on both chassis, and by 11-Apr-2016 19:07 UTC normal customer traffic switching resumed.

During this 20 minute duration, the switching fabric on one of the line cards of bcr03.wdc01 became congested. As a result, some customer traffic on the back-end private network for servers located behind this router experienced periods of increased latency for traffic being switched through interfaces on that component. Once the condition was manually cleared, network performance was restored to normal levels.
The hardware vendor was engaged to assist with the root cause investigation. While the router configuration and health was being reviewed, the incident recurred multiple times. While the research continued, SoftLayer personnel implemented an automated script to detect the condition and immediately apply the workaround, which significantly shorted the duration of the subsequent recurrences.
The hardware vendor confirmed that the incidents were occurring due to a software bug triggered on the back-end customer router bcr03.wdc1 which caused the sub-optimal forwarding path to become improperly programmed. Network engineers worked with the vendor to determine and validate a non-disruptive method to reset the hardware programming on the router to prevent further recurrences, without having to incur the disruption involved in performing an emergency code upgrade.
The action to gracefully disable and then re-enable each of the redundant uplinks on bcr03.wdc01 was completed successfully on 13-Apr-2016 03:00 UTC. One completed, there were no further recurrences. 

Future Mitigation:
The hardware vendor has confirmed that the potential for this issue recurring is present in the current code deployed on the router and has recommended a code upgrade as a permanent mitigation. Network Engineering management is reviewing that information to determine the best course of action regarding a future code upgrade, since that work would cause an extended disruption to customers behind this router.
Until a final determination is made, the workaround can be implemented again should the incident recur.

Server Density systems are designed to withstand one single failure of anything, may it a server failure or a data center failure. In this case, a data center network degradation, classifies as a data center failure. Even though most failure scenarios (eg. server failure) have an automated failover mechanism, a data center failure needs to be escalated to an engineer to take the decision to switch the primary data center. During this escalation, and in view of the mitigation action by our provider and facing a fully recovered service after less than 2 hours duration over a 6 hour period, we decided not to fail over.

The second incident, on April 13th, was initially thought to be a re-occurrence of the April 11th incident. It was, however, a complete loss of inter data center networking causing a network split and loss of our databases primaries. Our provider SoftLayer released the following reason for outage:

Background Information:
At SoftLayer we run our own a global Internet provider network backbone, which is engineered with multiple levels of equipment, circuit and geographic redundancy across multiple underlying carriers that allows us to provide continuous operations in the face of a variety of conditions including hardware failures, fiber cuts or carrier outages. We routinely perform maintenance across the backbone to provide additional capacity and functionality to support the needs of our clients. We understand the criticality of
the wide area network and as such perform each change under careful change control which include remediation procedures. Changes are also implemented incrementally so as to reduce the effect of unanticipated behavior.
One of the techniques which is used to assure optimized performance on the SoftLayer backbone is to utilize BGP route reflectors to propagate network information between cities and regions. Each of the three regions - Asia, North & South America, and Europe - have their own set of redundant route reflectors which service the backbone routers (BBRs) in that region.

Currently, we are preparing for future enhanced network functionality, which requires an additional software feature - “BGP additional paths” - to be enabled across all BBRs globally. This feature has already been successfully rolled out to all of the routers in the Asia and Europe regions, and we are currently in the progress of completing this addition to the routers throughout North & South America.

In the case of this incident, additional configuration to enable the additional paths feature was added to the router bbr01.cs01.lax01 as part of the continued roll-out. That configuration addition triggered a software bug on the route reflectors for North & South America, which caused multiple simultaneous failures of all the route reflectors in the North/South America regions. If one of the route reflectors had remained in service, no impact would have been felt.

Future Mitigation:
We have engaged with the hardware vendor to determine why enabling this feature on an additional BBR resulted in instability on the route reflector processes, especially considering that this type of activity has been performed numerous times throughout the network without any adverse impact. The hardware vendor has confirmed that there is a software bug which can be triggered in some cases due to a timing issue when some of the attached routers have the additional paths feature enabled, while other routers
are not configured with this feature.

SoftLayer network engineers have reviewed details about the software bug and the vendor’s recommended remediation. As a result we have decided to perform a code upgrade on the route reflectors in dal03, wdc02, sjc02, tok01, sng02, ams02, lon01 and syd02 before proceeding with the remaining new feature rollout. This will ensure that we can successfully complete the feature rollout without further adverse customer impact.

As mentioned earlier, Server Density deploys a single failure resilient design. As example, we deploy our MongoDB database servers across 3 different data centers to account for the loss of an entire data center. However, this design does not allow for the loss of more than one data center. This was exactly the side-effect of loosing all inter data center networking. This caused our database servers to loose quorum and failure to elect a primary as they were in fact isolated from each other. As soon as the network was recovered, the expected elections took place, primaries were found, and the service recovered.

Posted Apr 29, 2016 - 12:14 BST

Resolved

We have not seen further occurrences of this issue. Our provider has declared it fixed and released a Reason for Outage, that we'll be reviewing and publishing as part of the full post-mortem.

Posted Apr 19, 2016 - 12:50 BST

Update

The original cause of this incident was resolved this morning at 04:00 UTC when Softlayer networking engineers updated the firmware on the affected switches. However, at 06:16 UTC work being performed on a redundant backbone router in the LAX01 PoP caused a high rate of internal routing updates to the US-based route reflectors. As a result, the route reflectors experienced instability in the routing process, which ultimately caused that software component to crash.

Regional traffic within Europe, Asia, India would have for the most part not have been affected by this incident. Likewise, traffic within a given city in the US datacenters would have remained unaffected, and most Internet ingress/egress to datacenters in North and South America also continued without notable disruption.

The primary impact was traffic between datacenters, where at least one of the datacenters was located in North or South America. Both public and private network traffic within those regions would have been similarly impacted.

As network engineers normalized the LAX01 router, the high rate of routing message updates ceased, and the route reflectors stabilized without any further manual intervention. Once stabilized, normal routing resumed shortly thereafter.

Since we deploy across multiple datacenters in North America (Washington, San Jose and Toronto), our public and private networking traffic was affected. This caused service unavailability between 06:16 UTC and 06:52 UTC.

We will be working with Softlayer to understand the root cause of both of these incidents and will publish a post mortem within the next few days.

Posted Apr 13, 2016 - 08:15 BST

Update

We will continue to monitor this issue and work with SoftLayer to get confirmation that networking capability at our datacenters is fully restored.

Posted Apr 13, 2016 - 08:01 BST

Monitoring

At this time we have a full recovery and will also be enabling device nodata alerting. Between 06:30 and 06:35, before we disabled this alerting, some nodata alerts may have been sent for devices that we had not yet restored payload processing. These would have been cleared before 06:40.

Posted Apr 13, 2016 - 07:56 BST

Identified

We've just got confirmation that the SoftLayer network in the US began experiencing routing instability earlier. The primary impact to us is that traffic into / out of our datacenters have experienced a disruption.
At this time we were able to recover some service. Alerting is restored except for device nodata alerts that we are keeping off until we are back to normal payload processing.

Posted Apr 13, 2016 - 07:37 BST

Investigating

At this time our payload processing has stalled again. We are investigating if this occurrence is a result of our provider network recovery attempts.

Posted Apr 13, 2016 - 07:16 BST

Update

While network engineers in one of our Softlayer data centres were performing an emergency maintenance to correct a switching issue between the two chassis, an errant command was applied. The errant command was corrected shortly afterwards, at which time normal traffic flow was restored.

This command caused the router to switch some traffic through a sub-optimal path for a portion of the aggregate switches downstream of this router. As a result, some of our backend servers experienced increased latency on the private network.

About an hour after this, the increased latency recurred due to a bug in the switch firmware. Network engineers rapidly responded to manually clear out the incident and resume normal traffic forwarding. This caused increased latency on the backend network and slowed down our alert processing.

After the first occurrence, the bug triggered a further 3 times but network engineers were able to clear it within a minute each time.

The hardware vendor has identified the bug which is being triggered, and has recommended a course of action. Network management are reviewing the recommendation to determine the best implementation plan.

Network engineers will continue to monitor closely to ensure rapid remediation for future recurrences. We will provide another update as we have more information.

Posted Apr 12, 2016 - 08:22 BST

Update

The root cause of this incident has been identified as increased network latency on switches that power our internal networking. Our provider is working to restore normal service. Until that happens we are expecting further occurrences of this issue and will update this when they happen.

Posted Apr 12, 2016 - 00:19 BST

Monitoring

We have not yet concluded work on this issue but all systems have been back to normal for the past hour. We'll continue to monitor overnight.

Posted Apr 11, 2016 - 23:23 BST

Identified

payload processing is back to normal for the past 30 min.. We'll continue to investigate the issue.

Posted Apr 11, 2016 - 22:46 BST

Investigating

We are experiencing a delay in payload processing and we are investigating the cause. This will result in alerting delays.

Posted Apr 11, 2016 - 22:18 BST