EVENT DRIVEN AUTO REMEDIATION
Challenge: Responding to Every Network Problem
- A user complains about an application not working, only to realize that there is a connectivity issue. The user then must open a service ticket which then alerts the company that it faces the inevitable issue of a critical outbound interface going down. The ticket gets moved by the service team to the network team with a priority 1 for immediate action regardless of the time of day.
- The network team, oftentimes woken up in the middle of the night to deal with the issue, must then troubleshoot and run diagnostics to discover which interface is Once identified the remediation action of bringing up the interface occurs, the connection is verified and the service ticket is updated and closed.
- These user complaints pile up leading to an overwhelmed service ticket system, meanwhile the network team must always respond to the event as though they are priority 1 until they are correctly identified, even if a simple interface-up command would resolve the issue. This leads to increased stress for the network team and increased employment costs through overtime worked for issues that could wait until the morning.
The Conventional Workflow Approach
A user recognizes that a service has gone down and creates a ServiceNow ticket. Service Ops assigns the ticket to NetOps with a Priority 1.
NetOps team members now manually start diagnostics, discovering that an interface is down. Remediation action is performed to bring up the interface.
NetOps team now either updates the ticket if successful or continues running diagnostics to discover why the interface went down, staying as a priority level 1 task. The ServiceNow record is then updated and closed.
Orchestral.ai's Symphony Solution
existing IT tools and practices are maintained.
To begin, the Orchestral Data Bot, a multi-vendor data collector, collects statistical data from all the infrastructure end points and publishes the data to Maestro’s infrastructure telemetry data store. The Data Bot also collects all syslog information from network switches which enables Maestro to recognize immediately when a critical outbound interface goes down. Once recognized, Maestro will trigger a Composer auto_remediation_workflow without delay. Composer then executes the following steps:
- If the router is accessible: Informs the operations teams about the outage through omni-communicational chatops and indicates the start of the auto_remediation_workflow.
- Collects the show tech information on the router before and after the remediation action. Zips the two files as the artifact of the incident.
- Composer then opens a service ticket with priority 5 on the ticketing system and attaches the troubleshooting artefact for further analysis.
- Lastly, informs the Ops team through chatops of the new incident created and the number for analysis.