Please read the following as it could have an impact on some of your customers.
Incident Reference: Gamma Ref-BB28142014
Start Date: 28th August 2014
Start Time: 02:09
Actual Clear Date: 28th August 2014
Actual Clear Time: 22:10
Incident Summary Loss of broadband connectivity with further impact on some voice services.
Incident Details Broadband connectivity to our Trafford (North) and Paul St (South) nodes interrupted by planned maintenance on BT network.
Once the BT services were restored, the terminating devices for subscribers on Gamma’s network could not recover the lost sessions. This prolonged the outage for the majority of our BB services.
In addition to the failure of services dependent on the BB connectivity, the congestion caused some failed and poor quality calls for our SIP trunking, Horizon, IB2, CPS & IDA services.
Timeline 02:09 – NOC alerts shows loss of connectivity to our PST and TFD nodes.
02:30 – On-call transmission engineers fully engaged in diagnostics.
02:30 – 03:15 – Diagnostics indicate that majority of BB connectivity has been lost and initial attempts at recovery failing.
03:20 – Major Service Outage process invoked.
03:20 – Gamma MSO bridge opened
03:20 – 04:00 – Additional engineering engaged and working on resolution.
04:15 – First customer alert sent, thence regular updates throughout the day.
04:20 – BT Engineering teams join Gamma bridge
04:20 – 05:00 – Gamma and BT engaged in co-op diagnostics. At this point it stated to become clear that there was an issue with the process for subscribers being retained on the network. There was a constant churn of subscribers joining and then dropping again after a 120sec window. BT indicated that planned works at a local exchange had commenced at the same time of the outage at 02:10.
05:15-05:45 – Connectivity begins to return and approximately 25% of subscribers have successfully rejoined the network.
06:00 – Subscribers numbers fall rapidly again and most recovered sessions are dropped.
06:15-09:00 – BT performed various tests and remedial works on the network by rerouting traffic across both their core networks. BT/Gamma reviewing and tracking individual subscriber ingress/egress through network.
BT now assisting with review of what happened at 05:45 that caused a partial restore of sessions. BT can ping the Gamma tunnel termination devices but they appear unreachable from elsewhere within the BT network. BT Access Control List being removed at Manchester to see if that assists in resolving apparent routing issues.
08:50-09:10 – Gamma commence full restart of selected core equipment in data path. This process is intrusive and only taken in exceptional circumstances. The restarts have no beneficial impact.
09:10hrs: BT begins detailed review of changes made at local exchange that may have triggered the outage. Reversion to conditions prior to the change has no impact.
09:15 – Equipment vendors fully engaged and reviewing detailed logs and traces of network activity.
09:30-11:30 – At this point the focus of investigations is a routing or IP conflict. As the individual sessions are built through a very large number of routes extensive work is done to reduce the routing to a smaller more manageable level (focused on our Trafford node) to allow effective diagnostics. This is complex and must be achieved without further impact to stable data services.
11:35 – After extensive analysis equipment vendors report they can find no obvious issues with core devices handling traffic.
11:51: Majority of IP stream customers now stable on Trafford node.
12:05 – BT revert their changes to the core network, re-introducing redundant paths.
12:19 – BT confirmed they have fully reverted their network to standard topology.
12:25 – Begin re-establishing WBC links at Trafford. Using a route map we start to allow our terminating equipment to respond to tunnel setups from a small BT subnet that restricts subscription attempts. This process expanded slowly.
12:43 – Limited number of WBC customers begin to return to service.
12:58-15:00- Subscribers continue to be introduced in a controlled fashion to avoid any losses of existing circuits.
15:00 – Gamma re-establishes the IPStream and WBC links at North and Southern nodes. (TFD & PST).
15:00 – 17:00 To alleviate the load on Gamma termination equipment BT apply an outbound Access Control List (ACL) towards Gamma.
17:40: BT ACL proves effective in allowing increased rate of subscriber reconnects. Gamma introduce similar process on own equipment to introduce BT subnets in a more controlled fashion and returns network to fully routed status. Thus proves to be stable, allowing us to reach higher subscriber levels.
19:36 – All host links back up. Core systems stable
19:55-21:45 – Continuing the process of bringing the subscribers back online by permitting more subnets in the inbound ACL. Connectivity being managed to ensure that subscribers fully balanced over host links.
22:00: All subnets now permitted. A small number of subscribers had not returned to service but this was expected as often CPE require rebooting.
23:59hrs: Final balancing of subscribers across host links carried out and network and subscribers fully stable.
Corrective Action After extensive network topology reroutes and detailed diagnostics subscribers were returned to normal level by restricting the rate at which connections were being reestablished to prevent overload of Gamma core network devices.
This process is now built into edge network devices and in the unlikely event of a similar failure, will enable a more rapid restoration of subscribers.
The resulting congestion in the remainder of the Gamma network caused many reports of impact on voice services. This was addressed through rerouting traffic and increasing bandwidth as required on congested routes. These latter measures will remain in place until a full RCA is completed.
Gamma operates a fully resilient network and to date have successfully redirected traffic between nodes in the event of infrastructure failures with no impact on subscribers.
Gamma’s core termination equipment is rated to carry many more subscribers than currently active and consequently this will be one of the main areas of investigation.
Work will also focus on how an external incident was able to impact all elements of our subscriber termination equipment. Extensive load tests will be made within our lab environment in close cooperation with equipment vendors to attempt to reproduce the failure modes experienced.
We will be working with BT to fully understand what part their planned maintenance works played in triggering such a large failure and to ensure that we are adequately prepared should there be similar works.
We will also be closely reviewing the handling of subscriber’s restoration rates within our network in the event of termination failures and the larger than expected signaling levels experienced.
This work will be detailed and exhaustive and we expect to have results within the next two weeks.
Next Update Final RFO when work above completed.