SRE Managers: Building Automation to Prevent Problem Recurrence and Automate Response to Service Issues
RigD Platform for Collaborative Automation Part 1
Our users know RigD as a Slack Bot. You talk to it in Slack and it does stuff for you. Its an interesting Bot that is like a command line on steriods, but it has more capabilities due to the Natural Language Understanding and AI, but more on that in the 4th part in this series. If you are work in DevOps or an SRE, or in IT Operations, or work in major incident management, I don’t need to tell you how important and visible business impacting incidents are. You and your team have too much work to do, not enough automation, not enough collaboration and generally less than optimal learning cycles. If you are an SRE manager, you were hired to lead teams that will build automation to prevent problem recurrence and ultimately automate response to service issues. The RigD platform can help you further that goal.
Use Cases for Collaborative Automation
The RigD platform, through the Slack App, can help Support teams to collaborate better and faster to improve on call:
- Reduce alerting noise, route to the right staff immediately, save time with escalations and other incident lifecycle actions
- Use the power of Slack to collaborate and work issues, expedite problem solving
- Run triage flows to guide the staff and automate data collection
- Accelerate ticket resolution times and get updates in real-time
For Admins and SREs, RigD can be instrumental in improving your tooling and processes to make incidents and problems resolve faster next time:
- Respond quick with higher quality issue remediation quicker
- Reduce reoccurring issues and repetitive escalations
- Provide your teams with the tooling to help them be more efficient and solve problems faster
- Interact with your favorite tools to both get data and push status
For major incidents, the Incident Commander, RigD can help move the team through incidents faster so that service is restored sooner:
- Control the Incident lifecycle, within your collaboration platform
- Made data-based decision with confidence and team buy in
- Have confidence in delegation and the team process and learning
IT Operation and Line of Business Application Owners can get the right people and software assets working to identify and remediate fast
Focus your teams on what matters the most
- Mobilize the right people to respond with agility and collaboration through the lifecycle
- Reduce response and resolution times to get customers back up quickly
- Reduce efforts in getting status and identify areas for team improvement
Solution Process Steps for SREs
No matter what application, underlying infrastructure (Kubernetes, VMs, AWS, GCP or Azure) DevOps and IT Operations have a support model that can be deconstructed into five steps: engagement, verification, investigation, resolution and Improvement. Whether this is app, network, storage or infrastructure or any number of other technical domains, we have observed that this process is generally followed by teams that focus on best practices. Steps may have different names and may command a larger calorie expenditure from one team or company to the next, but generally aligning the work activities in these spaces is a useful construct to analyze the collaboration needed.
What can collaboration due to help out in these steps. Using a tool such as Slack can help everyone stay on the same page as key information is discovered, shared, commented on and next steps are determined.
- Engage – here the incident (or problem) arrives either through a customer call or automatic alterts coming in from any number of monitoring tools and sources. Depending on the data source or the infrastructure component that this is about, the incident management systems will need to get started on what they do best, managing the incident lifecycle. Having this all get started in slack, visible to key players and the team, including spinning up a dedicated Slack channel for each incident is a great way for everyone to get on the same page about what just happened. Creating an incident through a simple Slack command and update the information in Slack about the incident is a great time saver. Using the RigD work concept can put a “container” around all of the Slack information making the postmortem easier. (more on that later)
- Verify – We have all seen false positives coming in through the monitoring and logging systems as well as questionable issues being identify. This next process step involves verifying the information and context about an incident to ensure that the right information and important is ascribed to the incident or problem at hand. Often, we can figure out which team needs to be engages and then we can find out who is on call via simple commands in Slack.
- Investigate – By now we have brought in the key folks on call and they have started the detailed investigations. Bringing in key monitoring or log data for review in Slack, versus cut and pasting, can save a lot of time and make the transparency and learning process for junior team members a key activity for this phase. During this investigation, keeping key people up to date in Slack or other means in critical in managing the incident response.
- Resolve – Once all the data is collected and necessary conversation, tribal knowhow, and key infrastructure or service data is reviewed, the team moves into the resolution phase and this can be as simple as a command in the AWS console, or as complicated as running an automation sequence in the tool of choice, perhaps even with Slack.
- Improve – Thankfully the incident or problem got resolved quickly and the virtual standup team has been disbanded. But this is when the fun begins. The owner of the service or incident will undoubtedly want a postmortem. In face she or he will bug the key players until they produce one. We all know that another incident or problem will show up with the sunrise or sunset and people naturally will forget the work they promised to do. This is where collaborative automation can help and provide reminders for the key people to conduct conclude the postmortem. Having a great starting point, the RigD Postmortem Incident Work Digest is a great way to start the process and data flow. Additionally, the improve aspect of the process will involve key learnings around triggers of the incident, technical issues that contributed, and people issues that either sped up or delayed the response.