Between April 11th 18:47 UTC and April 13th 06:13 UTC we experienced two distinct incidents on our internal network.
The first incident, on April 11th, caused high latency between a significant number of our servers, which degraded our service, delaying our processing of payloads and, consequently, alerting. Our provider SoftLayer released the following reason for outage:
At 11-Apr-2016 18:47 UTC, the SoftLayer Network Operations Center began receiving reports of increased traffic latency to hosts behind bcr03.wdc01.
Network Engineers began to investigate and discovered that some of the traffic egressing the router chassis pair was improperly being switched across the “side link” between the two chassis, taking a sub-optimal path. Once identified, they were able to force the router to re-program it’s traffic forwarding path to take the direct egress links on both chassis, and by 11-Apr-2016 19:07 UTC normal customer traffic switching resumed.
During this 20 minute duration, the switching fabric on one of the line cards of bcr03.wdc01 became congested. As a result, some customer traffic on the back-end private network for servers located behind this router experienced periods of increased latency for traffic being switched through interfaces on that component. Once the condition was manually cleared, network performance was restored to normal levels.
The hardware vendor was engaged to assist with the root cause investigation. While the router configuration and health was being reviewed, the incident recurred multiple times. While the research continued, SoftLayer personnel implemented an automated script to detect the condition and immediately apply the workaround, which significantly shorted the duration of the subsequent recurrences.
The hardware vendor confirmed that the incidents were occurring due to a software bug triggered on the back-end customer router bcr03.wdc1 which caused the sub-optimal forwarding path to become improperly programmed. Network engineers worked with the vendor to determine and validate a non-disruptive method to reset the hardware programming on the router to prevent further recurrences, without having to incur the disruption involved in performing an emergency code upgrade.
The action to gracefully disable and then re-enable each of the redundant uplinks on bcr03.wdc01 was completed successfully on 13-Apr-2016 03:00 UTC. One completed, there were no further recurrences.
Future Mitigation:
The hardware vendor has confirmed that the potential for this issue recurring is present in the current code deployed on the router and has recommended a code upgrade as a permanent mitigation. Network Engineering management is reviewing that information to determine the best course of action regarding a future code upgrade, since that work would cause an extended disruption to customers behind this router.
Until a final determination is made, the workaround can be implemented again should the incident recur.
Server Density systems are designed to withstand one single failure of anything, may it a server failure or a data center failure. In this case, a data center network degradation, classifies as a data center failure. Even though most failure scenarios (eg. server failure) have an automated failover mechanism, a data center failure needs to be escalated to an engineer to take the decision to switch the primary data center. During this escalation, and in view of the mitigation action by our provider and facing a fully recovered service after less than 2 hours duration over a 6 hour period, we decided not to fail over.
The second incident, on April 13th, was initially thought to be a re-occurrence of the April 11th incident. It was, however, a complete loss of inter data center networking causing a network split and loss of our databases primaries. Our provider SoftLayer released the following reason for outage:
Background Information:
At SoftLayer we run our own a global Internet provider network backbone, which is engineered with multiple levels of equipment, circuit and geographic redundancy across multiple underlying carriers that allows us to provide continuous operations in the face of a variety of conditions including hardware failures, fiber cuts or carrier outages. We routinely perform maintenance across the backbone to provide additional capacity and functionality to support the needs of our clients. We understand the criticality of
the wide area network and as such perform each change under careful change control which include remediation procedures. Changes are also implemented incrementally so as to reduce the effect of unanticipated behavior.
One of the techniques which is used to assure optimized performance on the SoftLayer backbone is to utilize BGP route reflectors to propagate network information between cities and regions. Each of the three regions - Asia, North & South America, and Europe - have their own set of redundant route reflectors which service the backbone routers (BBRs) in that region.
Currently, we are preparing for future enhanced network functionality, which requires an additional software feature - “BGP additional paths” - to be enabled across all BBRs globally. This feature has already been successfully rolled out to all of the routers in the Asia and Europe regions, and we are currently in the progress of completing this addition to the routers throughout North & South America.
In the case of this incident, additional configuration to enable the additional paths feature was added to the router bbr01.cs01.lax01 as part of the continued roll-out. That configuration addition triggered a software bug on the route reflectors for North & South America, which caused multiple simultaneous failures of all the route reflectors in the North/South America regions. If one of the route reflectors had remained in service, no impact would have been felt.
Future Mitigation:
We have engaged with the hardware vendor to determine why enabling this feature on an additional BBR resulted in instability on the route reflector processes, especially considering that this type of activity has been performed numerous times throughout the network without any adverse impact. The hardware vendor has confirmed that there is a software bug which can be triggered in some cases due to a timing issue when some of the attached routers have the additional paths feature enabled, while other routers
are not configured with this feature.
SoftLayer network engineers have reviewed details about the software bug and the vendor’s recommended remediation. As a result we have decided to perform a code upgrade on the route reflectors in dal03, wdc02, sjc02, tok01, sng02, ams02, lon01 and syd02 before proceeding with the remaining new feature rollout. This will ensure that we can successfully complete the feature rollout without further adverse customer impact.
As mentioned earlier, Server Density deploys a single failure resilient design. As example, we deploy our MongoDB database servers across 3 different data centers to account for the loss of an entire data center. However, this design does not allow for the loss of more than one data center. This was exactly the side-effect of loosing all inter data center networking. This caused our database servers to loose quorum and failure to elect a primary as they were in fact isolated from each other. As soon as the network was recovered, the expected elections took place, primaries were found, and the service recovered.